Authors:Jessica Bader, Leander Girrbach, Stephan Alaniz, Zeynep Akata
Abstract:
Concept Bottleneck Models (CBMs) and other concept-based interpretable models show great promise for making AI applications more transparent, which is essential in fields like medicine. Despite their success, we demonstrate that CBMs struggle to reliably identify the correct concepts under distribution shifts. To assess the robustness of CBMs to concept variations, we introduce SUB: a fine-grained image and concept benchmark containing 38,400 synthetic images based on the CUB dataset. To create SUB, we select a CUB subset of 33 bird classes and 45 concepts to generate images which substitute a specific concept, such as wing color or belly pattern. We introduce a novel Tied Diffusion Guidance (TDG) method to precisely control generated images, where noise sharing for two parallel denoising processes ensures that both the correct bird class and the correct attribute are generated. This novel benchmark enables rigorous evaluation of CBMs and similar interpretable models, contributing to the development of more robust methods. Our code is available at https://github.com/ExplainableML/sub and the dataset at http://huggingface.co/datasets/Jessica-bader/SUB.
Chinese: 概念瓶颈模型(CBMs)在分布变化下难以可靠识别正确概念,为此我们引入了包含38,400张合成图像的SUB基准和捆绑扩散引导方法,以严格评估并推动更稳健可解释模型的发展。
English: Concept Bottleneck Models (CBMs) face challenges in accurately identifying concepts under distribution shifts, prompting the development of the SUB benchmark with 38,400 synthetic images and a Tied Diffusion Guidance method to evaluate and enhance their robustness.
Authors:Zihan Wang, Jeff Tan, Tarasha Khurana, Neehar Peri, Deva Ramanan
Abstract:
We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on https://github.com/ImNotPrepared/MonoFusion.
Chinese: 本研究提出一种通过对齐独立单目重建来从稀疏视角视频中重构动态场景的方法,在新视角合成方面相比现有技术取得了更优的效果。
English: This study introduces a method for reconstructing dynamic scenes from sparse-view videos by aligning independent monocular reconstructions, achieving superior results over existing techniques especially in novel view synthesis.
Authors:Justin Kay, Grant Van Horn, Subhransu Maji, Daniel Sheldon, Sara Beery
Abstract:
The widespread availability of off-the-shelf machine learning models poses a challenge: which model, of the many available candidates, should be chosen for a given data analysis task? This question of model selection is traditionally answered by collecting and annotating a validation dataset -- a costly and time-intensive process. We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. Our method, CODA, performs consensus-driven active model selection by modeling relationships between classifiers, categories, and data points within a probabilistic framework. The framework uses the consensus and disagreement between models in the candidate pool to guide the label acquisition process, and Bayesian inference to update beliefs about which model is best as more information is collected. We validate our approach by curating a collection of 26 benchmark tasks capturing a range of model selection scenarios. CODA outperforms existing methods for active model selection significantly, reducing the annotation effort required to discover the best model by upwards of 70% compared to the previous state-of-the-art. Code and data are available at https://github.com/justinkay/coda.
Chinese Summary: CODA提出了一种主动模型选择方法,利用候选模型间的共识与分歧来优先标注数据,相比现有技术将发现最佳模型所需的标注工作量减少了70%以上。
English Summary: CODA introduces an active model selection method that uses consensus and disagreement among candidate models to prioritize data labeling, significantly reducing annotation effort by over 70% compared to existing approaches.
Authors:Rongzhen Zhao, Yi Zhao, Juho Kannala, Joni Pajarinen
Abstract:
Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input's reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our source code and model checkpoints are available on https://github.com/Genera1Z/DIAS.
中文: 提出的DIAS方法通过重新初始化冗余槽位并实现注意力图的自蒸馏,增强了以对象为中心的学习,在物体发现和识别任务中达到了最先进的性能。
English: The proposed DIAS method enhances Object-Centric Learning by re-initializing redundant slots and implementing self-distillation of attention maps, achieving state-of-the-art performance in object discovery and recognition tasks.
Authors:Nasim Shirvani-Mahdavi, Devin Wingfield, Amin Ghasemi, Chengkai Li
Abstract:
Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules. Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237 and two large-scale datasets, FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges. Our results demonstrate promising performance in terms of explanation correctness and clarity, although several challenges remain for future research. All scripts and data used in this study are publicly available at https://github.com/idirlab/KGRule2NL}{https://github.com/idirlab/KGRule2NL.
知识图谱可通过逻辑规则推断新事实,本研究利用大型语言模型为这些规则生成自然语言解释,并通过人工与自动评估检验了其正确性与清晰度。
Knowledge graphs can infer new facts through logical rules, and this study uses large language models to generate natural language explanations for these rules, evaluating their correctness and clarity through human and automated assessments.
Authors:Dongming Wu, Yanping Fu, Saike Huang, Yingfei Liu, Fan Jia, Nian Liu, Feng Dai, Tiancai Wang, Rao Muhammad Anwer, Fahad Shahbaz Khan, Jianbing Shen
Abstract:
General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at https://github.com/wudongming97/AffordanceNet.
中文: 研究人员构建了RAGNet大规模功能分割基准和AffordanceNet框架,通过结合视觉语言模型与功能地图,显著提升了机器人在开放环境中的抓取能力,实验证明其具有优异的泛化性能。
English: Researchers developed RAGNet, a large-scale affordance segmentation benchmark with human-like instructions, and AffordanceNet, a framework that enhances open-world robotic grasping by integrating visual-language models with affordance maps, demonstrating strong generalization in experiments.
Authors:Emery Pierson, Lei Li, Angela Dai, Maks Ovsjanikov
Abstract:
Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: https://github.com/daidedou/diffumatch/
中文摘要:本研究首次提出用数据驱动的谱域扩散生成模型替代传统公理化的功能映射正则化方法,实现了无需类别先验的零样本非刚性形状匹配,其性能优于现有公理化方法。
English Summary: This paper introduces a novel data-driven approach that replaces traditional axiomatic regularization in deep functional maps with a category-agnostic generative model trained via spectral domain diffusion, achieving superior zero-shot shape matching performance.
Authors:Haipeng Liu, Yuxuan Liu, Ting Long
Abstract:
Personalized question recommendation aims to guide individual students through questions to enhance their mastery of learning targets. Most previous methods model this task as a Markov Decision Process and use reinforcement learning to solve, but they struggle with efficient exploration, failing to identify the best questions for each student during training. To address this, we propose Ranking Alignment Recommendation (RAR), which incorporates collaborative ideas into the exploration mechanism, enabling more efficient exploration within limited training episodes. Experiments show that RAR effectively improves recommendation performance, and our framework can be applied to any RL-based question recommender. Our code is available in https://github.com/wuming29/RAR.git.
中文: 提出的排序对齐推荐(RAR)将协同过滤思想融入强化学习的探索机制,通过更高效的训练显著提升了个性化题目推荐的性能。
English: The proposed Ranking Alignment Recommendation (RAR) integrates collaborative filtering into reinforcement learning exploration to enhance personalized question recommendation by enabling more efficient training and improved performance.
Authors:Yang Gao, Po-Chien Luan, Kaouther Messaoud, Lan Feng, Alexandre Alahi
Abstract:
While large-scale pre-training has advanced human trajectory prediction, a critical challenge remains: zero-shot transfer to unseen dataset with varying temporal dynamics. State-of-the-art pre-trained models often require fine-tuning to adapt to new datasets with different frame rates or observation horizons, limiting their scalability and practical utility. In this work, we systematically investigate this limitation and propose a robust solution. We first demonstrate that existing data-aware discrete models struggle when transferred to new scenarios with shifted temporal setups. We then isolate the temporal generalization from dataset shift, revealing that a simple, explicit conditioning mechanism for temporal metadata is a highly effective solution. Based on this insight, we present OmniTraj, a Transformer-based model pre-trained on a large-scale, heterogeneous dataset. Our experiments show that explicitly conditioning on the frame rate enables OmniTraj to achieve state-of-the-art zero-shot transfer performance, reducing prediction error by over 70\% in challenging cross-setup scenarios. After fine-tuning, OmniTraj achieves state-of-the-art results on four datasets, including NBA, JTA, WorldPose, and ETH-UCY. The code is publicly available: https://github.com/vita-epfl/omnitraj
中文: 该研究提出了OmniTraj模型,通过显式的时间元数据调节机制,在人类轨迹预测中实现了卓越的零样本迁移能力,并在多个数据集上取得了最优性能。
English: The study introduces OmniTraj, a Transformer-based model that uses explicit temporal metadata conditioning to achieve superior zero-shot transfer and state-of-the-art performance in human trajectory prediction across diverse datasets.
Authors:Moaad Khamlich, Francesco Romor, Gianluigi Rozza
Abstract:
Semi-discrete optimal transport (SOT), which maps a continuous probability measure to a discrete one, is a fundamental problem with wide-ranging applications. Entropic regularization is often employed to solve the SOT problem, leading to a regularized (RSOT) formulation that can be solved efficiently via its convex dual. However, a significant computational challenge emerges when the continuous source measure is discretized via the finite element (FE) method to handle complex geometries or densities, such as those arising from solutions to Partial Differential Equations (PDEs). The evaluation of the dual objective function requires dense interactions between the numerous source quadrature points and all target points, creating a severe bottleneck for large-scale problems. This paper presents a cohesive framework of numerical strategies to overcome this challenge. We accelerate the dual objective and gradient evaluations by combining distance-based truncation with fast spatial queries using R-trees. For overall convergence, we integrate multilevel techniques based on hierarchies of both the FE source mesh and the discrete target measure, alongside a robust scheduling strategy for the regularization parameter. When unified, these methods drastically reduce the computational cost of RSOT, enabling its practical application to complex, large-scale scenarios. We provide an open-source C++ implementation of this framework, built upon the deal.II finite element library, available at https://github.com/SemiDiscreteOT/SemiDiscreteOT.
中文: 本文提出了一种高效的数值框架,通过结合空间查询优化和多层次技术,加速了基于熵正则化的半离散最优输运计算,使其能够适用于涉及有限元离散化复杂几何的大规模问题。
English: This paper introduces an efficient numerical framework that accelerates semi-discrete optimal transport with entropic regularization by combining spatial query optimization and multilevel techniques, making it applicable to large-scale problems involving complex geometries from finite element discretizations.
Authors:Xin Li, Keren Fu, Qijun Zhao
Abstract:
Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearance features to perceive motion cues for breaking camouflage. However, the high similarity between foreground and background in VCOD results in limited discriminability of spatial appearance features (e.g., color and texture), restricting detection accuracy and completeness. Recent studies demonstrate that frequency features can not only enhance feature representation to compensate for appearance limitations but also perceive motion through dynamic variations in frequency energy. Furthermore, the emerging state space model called Mamba, enables efficient perception of motion cues in frame sequences due to its linear-time long-sequence modeling capability. Motivated by this, we propose a novel visual camouflage Mamba (Vcamba) based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, we propose a receptive field visual state space (RFVSS) module to extract multi-scale spatial features after sequence modeling. For frequency learning, we introduce an adaptive frequency component enhancement (AFE) module with a novel frequency-domain sequential scanning strategy to maintain semantic consistency. Then we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences in spatial and frequency phase domains. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features for unified motion representation. Experimental results show that our Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming the superiority of Vcamba. Our code is available at: https://github.com/BoydeLi/Vcamba.
中文摘要:提出的Vcamba模型通过整合空间与频域特征,结合专门的运动感知模块,在保持计算效率的同时,在多项评估指标上实现了卓越的视频伪装目标检测性能。
English Summary: The proposed Vcamba model integrates spatial and frequency features with specialized motion perception modules to achieve superior video camouflaged object detection performance across multiple metrics while maintaining computational efficiency.
Authors:Yu-Tang Chang, Shih-Fang Chen
Abstract:
Signal unmixing analysis decomposes data into basic patterns and is widely applied in chemical and biological research. Multivariate curve resolution (MCR), a branch of signal unmixing, separates mixed signals into components (base patterns) and their concentrations (intensity), playing a key role in understanding composition. Classical MCR is typically framed as matrix factorization (MF) and requires a user-specified number of components, usually unknown in real data. Once data or component number increases, the scalability of these MCR approaches face significant challenges. This study reformulates MCR as a data generative process (gMCR), and introduces an Energy-Based solver, EB-gMCR, that automatically discovers the smallest component set and their concentrations for reconstructing the mixed signals faithfully. On synthetic benchmarks with up to 256 components, EB-gMCR attains high reconstruction fidelity and recovers the component count within 5% at 20dB noise and near-exact at 30dB. On two public spectral datasets, it identifies the correct component count and improves component separation over MF-based MCR approaches (NMF variants, ICA, MCR-ALS). EB-gMCR is a general solver for fixed-pattern signal unmixing (components remain invariant across mixtures). Domain priors (non-negativity, nonlinear mixing) enter as plug-in modules, enabling adaptation to new instruments or domains without altering the core selection learning step. The source code is available at https://github.com/b05611038/ebgmcr_solver.
中文摘要:本研究提出EB-gMCR能量基求解器,将多元曲线分辨率重构为数据生成过程,能自动确定最小组分集及其浓度以实现精确信号重建,在合成和真实光谱数据集中均展现出优于传统方法的性能。
English Summary: This study introduces EB-gMCR, an energy-based solver that reformulates multivariate curve resolution as a generative process to automatically determine the smallest component set and their concentrations for accurate signal reconstruction, demonstrating superior performance over traditional methods in both synthetic and real spectral datasets.
Authors:Yaoye Zhu, Zhe Wang, Yan Wang
Abstract:
As cooperative systems that leverage roadside cameras to assist autonomous vehicle perception become increasingly widespread, large-scale precise calibration of infrastructure cameras has become a critical issue. Traditional manual calibration methods are often time-consuming, labor-intensive, and may require road closures. This paper proposes MamV2XCalib, the first V2X-based infrastructure camera calibration method with the assistance of vehicle-side LiDAR. MamV2XCalib only requires autonomous vehicles equipped with LiDAR to drive near the cameras to be calibrated in the infrastructure, without the need for specific reference objects or manual intervention. We also introduce a new targetless LiDAR-camera calibration method, which combines multi-scale features and a 4D correlation volume to estimate the correlation between vehicle-side point clouds and roadside images. We model the temporal information and estimate the rotation angles with Mamba, effectively addressing calibration failures in V2X scenarios caused by defects in the vehicle-side data (such as occlusions) and large differences in viewpoint. We evaluate MamV2XCalib on the V2X-Seq and TUMTraf-V2X real-world datasets, demonstrating the effectiveness and robustness of our V2X-based automatic calibration approach. Compared to previous LiDAR-camera methods designed for calibration on one car, our approach achieves better and more stable calibration performance in V2X scenarios with fewer parameters. The code is available at https://github.com/zhuyaoye/MamV2XCalib.
中文: 本文提出MamV2XCalib,首个基于车联网技术的路侧摄像头自动校准方法,通过车载激光雷达数据无需人工干预或特定参照物,在实际车联网场景中展现出更优的校准性能和稳定性。
English: This paper introduces MamV2XCalib, a novel V2X-based automatic calibration method for roadside cameras that utilizes vehicle LiDAR data without requiring manual intervention or reference objects, demonstrating superior performance and robustness in real-world V2X scenarios.
Authors:Alva West, Luodan Zhang, Liuliu Zhang, Minjun Zhu, Yixuan Weng, Yue Zhang
Abstract:
Large language models (LLMs) have shown the capability to generate fluent and logical content, presenting significant challenges to machine-generated text detection, particularly text polished by adversarial perturbations such as paraphrasing. Current zero-shot detectors often employ Gaussian distributions as statistical measure for computing detection thresholds, which falters when confronted with the heavy-tailed statistical artifacts characteristic of adversarial or non-native English texts. In this paper, we introduce T-Detect, a novel detection method that fundamentally redesigns the curvature-based detectors. Our primary innovation is the replacement of standard Gaussian normalization with a heavy-tailed discrepancy score derived from the Student's t-distribution. This approach is theoretically grounded in the empirical observation that adversarial texts exhibit significant leptokurtosis, rendering traditional statistical assumptions inadequate. T-Detect computes a detection score by normalizing the log-likelihood of a passage against the expected moments of a t-distribution, providing superior resilience to statistical outliers. We validate our approach on the challenging RAID benchmark for adversarial text and the comprehensive HART dataset. Experiments show that T-Detect provides a consistent performance uplift over strong baselines, improving AUROC by up to 3.9\% in targeted domains. When integrated into a two-dimensional detection framework (CT), our method achieves state-of-the-art performance, with an AUROC of 0.926 on the Books domain of RAID. Our contributions are a new, theoretically-justified statistical foundation for text detection, an ablation-validated method that demonstrates superior robustness, and a comprehensive analysis of its performance under adversarial conditions. Ours code are released at https://github.com/ResearAI/t-detect.
中文: T-Detect采用基于学生t分布的重尾差异评分方法,有效提升对抗性文本的检测鲁棒性,在RAID基准测试中实现了最先进的性能,检测准确率显著提高。
English: T-Detect introduces a novel detection method using a heavy-tailed discrepancy score from the Student's t-distribution to enhance resilience against adversarial texts, achieving state-of-the-art performance with significant improvements in detection accuracy.
Authors:Woo Kyoung Han, Yongjun Lee, Byeonghun Lee, Sang Hyun Park, Sunghoon Im, Kyong Hwan Jin
Abstract:
Despite significant advances in learning-based lossy compression algorithms, standardizing codecs remains a critical challenge. In this paper, we present the JPEG Processing Neural Operator (JPNeO), a next-generation JPEG algorithm that maintains full backward compatibility with the current JPEG format. Our JPNeO improves chroma component preservation and enhances reconstruction fidelity compared to existing artifact removal methods by incorporating neural operators in both the encoding and decoding stages. JPNeO achieves practical benefits in terms of reduced memory usage and parameter count. We further validate our hypothesis about the existence of a space with high mutual information through empirical evidence. In summary, the JPNeO functions as a high-performance out-of-the-box image compression pipeline without changing source coding's protocol. Our source code is available at https://github.com/WooKyoungHan/JPNeO.
中文: JPNeO是一种新一代JPEG算法,在完全向后兼容的基础上,通过神经算子提升色度保留和重建保真度,同时减少内存占用和参数数量,且无需更改源编码协议。
English: JPNeO is a next-generation JPEG algorithm that ensures full backward compatibility while enhancing chroma preservation and reconstruction fidelity through neural operators, offering reduced memory usage and parameter counts without altering source coding protocols.
Authors:Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, Jian Luan
Abstract:
While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat
中文摘要:本研究提出了MECAT多专家构建基准及DATE新型评估指标,通过细粒度音频理解任务解决现有基准的不足,为音频模型的性能评估提供了新视角。
English Summary: This study introduces MECAT, a multi-expert constructed benchmark with a novel DATE metric, to address the limitations of current audio benchmarks by enabling fine-grained evaluation and revealing new insights into audio models' capabilities.
Authors:Ting Huang, Zeyu Zhang, Hao Tang
Abstract:
Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.
中文: 3D-R1基础模型通过集成思维链合成数据集、带人类反馈的强化学习和动态视角选择策略,显著提升了三维视觉语言模型的推理与泛化能力,在多个三维场景基准测试中平均性能提高10%。
English: The 3D-R1 foundation model enhances 3D vision-language models by incorporating a synthetic dataset with Chain-of-Thought reasoning, reinforcement learning with human feedback, and dynamic view selection, achieving a 10% average improvement across 3D scene benchmarks.
Authors:Mingzhe Li, Xin Lu, Yanyan Zhao
Abstract:
Large language models (LLMs) with instruction following capabilities have demonstrated impressive problem-solving abilities. While synthesizing instructional data from unsupervised text has become a common approach for training such models, conventional methods rely heavily on human effort for data annotation. Although existing automated synthesis paradigms have alleviated this constraint, they still exhibit significant limitations in ensuring adequate diversity and difficulty of synthesized instructions. To address these challenges, we propose Self-Foveate, an innovative LLM-driven method for instruction synthesis. This approach introduces a "Micro-Scatter-Macro" multi-level foveation methodology that effectively guides the LLM to deeply excavate fine-grained information embedded in unsupervised text, thereby enhancing both the diversity and difficulty of synthesized instructions. Comprehensive experiments across multiple unsupervised corpora and diverse model architectures validate the effectiveness and superiority of our proposed method. We publicly release our data and codes: https://github.com/Mubuky/Self-Foveate
中文摘要:本文提出Self-Foveate方法,通过多级注视机制引导大语言模型深度挖掘无监督文本中的细粒度信息,有效提升合成指令的多样性与难度,并经多组实验验证其优越性。
English Summary: The paper introduces Self-Foveate, an LLM-driven method using a multi-level foveation approach to enhance the diversity and difficulty of synthesized instructions from unsupervised text, validated through comprehensive experiments.
Authors:Salah Eddine Bekhouche, Azeddine Benlamoudi, Yazid Bounab, Fadi Dornaika, Abdenour Hadid
Abstract:
Arabic poses a particular challenge for natural language processing (NLP) and information retrieval (IR) due to its complex morphology, optional diacritics and the coexistence of Modern Standard Arabic (MSA) and various dialects. Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources. In this paper, we present an enhanced Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At the core of our approach is a novel Attentive Relevance Scoring (ARS) that replaces standard interaction mechanisms with an adaptive scoring function that more effectively models the semantic relevance between questions and passages. Our method integrates pre-trained Arabic language models and architectural refinements to improve retrieval performance and significantly increase ranking accuracy when answering Arabic questions. The code is made publicly available at \href{https://github.com/Bekhouche/APR}{GitHub}.
Chinese: 本文针对阿拉伯语提出了一种增强型密集段落检索框架,通过创新的注意力相关性评分机制,有效提升了阿拉伯语问答的语义相关性建模和检索性能。
English: This paper introduces an enhanced Dense Passage Retrieval framework for Arabic, featuring a novel Attentive Relevance Scoring mechanism that improves semantic relevance modeling and retrieval performance for Arabic question answering.
Authors:Yijie Zhu, Lingsen Zhang, Zitong Yu, Rui Shao, Tao Tan, Liqiang Nie
Abstract:
Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model's understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at https://github.com/JiuTian-VL/UniEmo.
Chinese: UniEmo框架通过分层情感理解链提取多尺度情感特征,并指导扩散模型生成情感图像,结合双重反馈机制提升理解与生成能力,在实验中显著优于现有先进方法。
English: The UniEmo framework unifies emotional understanding and generation by using a hierarchical chain to extract multi-scale emotional features and guide a diffusion model for image generation, with dual feedback mechanisms enhancing both tasks and achieving state-of-the-art results.
Authors:Ali Youssef
Abstract:
This paper introduces VMatcher, a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs. Learning-based feature matching methods, whether detector-based or detector-free, achieve state-of-the-art performance but depend heavily on the Transformer's attention mechanism, which, while effective, incurs high computational costs due to its quadratic complexity. In contrast, Mamba introduces a Selective State-Space Model (SSM) that achieves comparable or superior performance with linear complexity, offering significant efficiency gains. VMatcher leverages a hybrid approach, integrating Mamba's highly efficient long-sequence processing with the Transformer's attention mechanism. Multiple VMatcher configurations are proposed, including hierarchical architectures, demonstrating their effectiveness in setting new benchmarks efficiently while ensuring robustness and practicality for real-time applications where rapid inference is crucial. Source Code is available at: https://github.com/ayoussf/VMatcher
中文: VMatcher是一种混合Mamba-Transformer网络,通过结合Mamba的线性复杂度高效性和Transformer的注意力机制,实现了半稠密特征匹配,以卓越的效率和鲁棒性为实时应用树立了新标杆。
English: VMatcher is a hybrid Mamba-Transformer network that combines the efficiency of Mamba's linear complexity with Transformer's attention for semi-dense feature matching, achieving state-of-the-art performance with enhanced speed and robustness for real-time applications.
Authors:Trae Research Team, Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, Xia Liu
Abstract:
Software issue resolution is a critical challenge in software engineering and has garnered increasing attention in recent years. With the rapid advancement of large language models (LLMs), substantial progress has been made in addressing real-world software engineering tasks. Recent studies have introduced ensemble reasoning techniques to enhance the performance of LLM-based issue resolution. However, existing prompting-based methods still face limitations in effectively exploring large ensemble spaces and lack the capacity for repository-level understanding, both of which constrain their overall effectiveness. In this paper, we propose Trae Agent, the first agent-based ensemble reasoning approach for repository-level issue resolution. Trae Agent formulates our goal as an optimal solution search problem and addresses two key challenges, i.e., large ensemble spaces and repository-level understanding, through modular agents for generation, pruning, and selection. We conduct extensive experiments using three leading LLMs on the widely-adopted SWE-bench benchmark, comparing Trae Agent against four state-of-the-art ensemble reasoning techniques. Experimental results demonstrate that Trae Agent consistently achieves superior performance, with an average improvement of 10.22% over all baselines in terms of Pass@1. Trae Agent has achieved first place on the SWE-bench Verified leaderboard, with a notable Pass@1 score of 75.20%. We are pleased to release Trae Agent as an open-source project to support the research community, with all resources available at https://github.com/bytedance/trae-agent.
中文: 本文提出了首个基于智能体的集成推理方法Trae Agent,通过生成、剪枝和选择模块化智能体解决大型集成空间探索和仓库级理解两大挑战,在SWE-bench基准测试中以75.20%的Pass@1得分实现最优性能,较基线方法平均提升10.22%。
English: This paper introduces Trae Agent, the first agent-based ensemble reasoning approach that addresses software issue resolution by overcoming limitations in exploring large ensemble spaces and achieving repository-level understanding through modular agents, demonstrating superior performance with a 10.22% average improvement and a leading 75.20% Pass@1 score on the SWE-bench benchmark.
Authors:Ji Ma, Wei Suo, Peng Wang, Yanning Zhang
Abstract:
Although large vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning, their practical applications are still limited by massive model parameters and high computational costs. Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a plausible training-free compression solution. However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. Through extensive experiments, we find that non-essential vision-language (VL) tokens and inter-layer feature gaps pose critical challenges to pruning layers in LVLMs. Based on these insights, we propose a novel framework Short-LVLM (SVL) that can utilize important VL tokens and mitigate the layer-wise feature gaps. Notably, Short-LVLM not only achieves a superior trade-off between performance and efficiency but also exhibits several potential advantages, i.e., training-free, model-agnostic, and highly compatible. The code for this work is publicly available at https://github.com/ASGO-MM/Short-LVLM.
Chinese: 大型视觉语言模型直接应用自然语言处理的层剪枝方法效果不佳,而Short-LVLM通过聚焦关键视觉语言标记和减小层间特征差异,实现了无需训练的高效性能提升。
English: Large vision-language models face inefficiency with NLP-based layer pruning, but Short-LVLM overcomes this by targeting key tokens and feature gaps to boost performance and efficiency without training.
Authors:Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin Cao, Qianxiang Wang
Abstract:
Recent advances in large language model (LLM) agents have shown remarkable progress in software issue resolution, leveraging advanced techniques such as multi-agent collaboration and Monte Carlo Tree Search (MCTS). However, current agents act as memoryless explorers - treating each problem separately without retaining or reusing knowledge from previous repair experiences. This leads to redundant exploration of failed trajectories and missed chances to adapt successful issue resolution methods to similar problems. To address this problem, we introduce SWE-Exp, an experience - enhanced approach that distills concise and actionable experience from prior agent trajectories, enabling continuous learning across issues. Our method introduces a multi-faceted experience bank that captures both successful and failed repair attempts. Specifically, it extracts reusable issue resolution knowledge at different levels - from high-level problem comprehension to specific code changes. Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6% Pass@1) on SWE-bench-Verified under open-source agent frameworks. Our approach establishes a new paradigm in which automated software engineering agents systematically accumulate and leverage repair expertise, fundamentally shifting from trial-and-error exploration to strategic, experience-driven issue resolution.
Chinese: SWE-Exp提出了一种经验增强方法,通过从过往修复轨迹中提炼可操作知识实现持续学习,在SWE-bench上达到41.6%的最优解决率,将软件修复从试错探索转变为基于经验的战略解决模式。
English: SWE-Exp introduces an experience-enhanced approach that distills actionable knowledge from prior repair trajectories, enabling continuous learning and achieving a state-of-the-art 41.6% resolution rate on SWE-bench by shifting from trial-and-error to strategic problem-solving.
Authors:Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, Qianxiang Wang
Abstract:
Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous, tool-using agents to tackle complex software engineering tasks. While existing agent-based issue resolution approaches are primarily based on agents' independent explorations, they often get stuck in local solutions and fail to identify issue patterns that span across different parts of the codebase. To address this limitation, we propose SWE-Debate, a competitive multi-agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization. SWE-Debate first creates multiple fault propagation traces as localization proposals by traversing a code dependency graph. Then, it organizes a three-round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace. This structured competition enables agents to collaboratively converge on a consolidated fix plan. Finally, this consolidated fix plan is integrated into an MCTS-based code modification agent for patch generation. Experiments on the SWE-bench benchmark show that SWE-Debate achieves new state-of-the-art results in open-source agent frameworks and outperforms baselines by a large margin.
Chinese: SWE-Debate提出了一种竞争性多智能体辩论框架,通过生成多样化故障传播路径并组织专业智能体进行结构化辩论,克服了局部解决方案局限,在SWE-bench基准测试中实现了软件问题修复的最先进性能。
English: SWE-Debate introduces a competitive multi-agent debate framework that overcomes local solution traps by generating diverse fault propagation traces and enabling structured debates among specialized agents, achieving state-of-the-art performance in software issue resolution on the SWE-bench benchmark.
Authors:Yingjie Zhou, Jiezhang Cao, Zicheng Zhang, Farong Wen, Yanwei Jiang, Jun Jia, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai
Abstract:
Speech-driven methods for portraits are figuratively known as "Talkers" because of their capability to synthesize speaking mouth shapes and facial movements. Especially with the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media. However, challenges persist regarding the quality of these talkers and AGTHs they generate, and comprehensive studies addressing these issues remain limited. To address this gap, this paper presents the largest AGTH quality assessment dataset THQA-10K to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts. After excluding instances where AGTH generation is unsuccessful, the THQA-10K dataset contains 10,457 AGTHs. Then, volunteers are recruited to subjectively rate the AGTHs and give the corresponding distortion categories. In our analysis for subjective experimental results, we evaluate the performance of talkers in terms of generalizability and quality, and also expose the distortions of existing AGTHs. Finally, an objective quality assessment method based on the first frame, Y-T slice and tone-lip consistency is proposed. Experimental results show that this method can achieve state-of-the-art (SOTA) performance in AGTH quality assessment. The work is released at https://github.com/zyj-2000/Talker.
中文摘要:本文提出了迄今最大的AI生成说话头部质量评估数据集THQA-10K,通过主观评估分析了12种文本生成图像模型和14种说话人生成器的表现,并提出基于首帧、Y-T切片和音唇一致性的客观评估方法,实现了最优性能。
English Summary: This paper introduces THQA-10K, the largest quality assessment dataset for AI-generated talking heads, which evaluates 12 text-to-image models and 14 talkers through subjective ratings and proposes an objective assessment method achieving state-of-the-art performance.
Authors:Tao He, Rongchuan Mu, Lizi Liao, Yixin Cao, Ming Liu, Bing Qin
Abstract:
Large reasoning models (LRMs) have recently shown promise in solving complex math problems when optimized with Reinforcement Learning (RL). But conventional approaches rely on outcome-only rewards that provide sparse feedback, resulting in inefficient optimization process. In this work, we investigate the function of process reward models (PRMs) to accelerate the RL training for LRMs. We propose a novel intrinsic signal-driven generative process evaluation mechanism operating at the thought level to address major bottlenecks in RL-based training. Specifically, instead of requiring PRMs to know how to solve problems, our method uses intrinsic signals in solutions to judge stepwise correctness and aggregate contiguous correct/incorrect steps into coherent 'thought' units. This structured, thought-level rewards enable more reliable credit assignment by reducing ambiguity in step segmentation and alleviating reward hacking. We further introduce a capability-adaptive reward mechanism that dynamically balances exploration and exploitation based on the LRM's current proficiency, guiding learning without stifling creative trial-and-error. These innovations are integrated into a new off-policy RL algorithm, TP-GRPO, which extends grouped proximal optimization with process-based rewards and improves training efficiency. Experiments on 1.5B and 7B parameter LRMs demonstrate that our method achieves higher problem-solving accuracy with significantly fewer training samples than outcome-only reward baselines. The results validate that well-structured process rewards can substantially accelerate LRM optimization in math reasoning tasks. Code is available at https://github.com/cs-holder/tp_grpo.
中文: 大型推理模型通过强化学习优化,采用思维层面的过程奖励模型评估逐步正确性,并根据模型能力自适应调整奖励,从而以更少训练样本实现更高解题准确率。
English: Large reasoning models optimized with reinforcement learning can solve complex math problems more efficiently using process reward models that evaluate stepwise correctness at the thought level and adapt rewards based on the model's capability, leading to higher accuracy with fewer training samples.
Authors:Vineet Kumar Rakesh, Soumya Mazumdar, Tapas Samanta, Sarbajit Pal, Amitabha Das
Abstract:
Lightweight convolutional and transformer-based models have become vital for real-time image classification in resource-constrained applications, such as embedded systems and edge devices. This work analyzes the influence of hyperparameter adjustment on the accuracy and convergence behavior of seven efficient deep learning architectures: EfficientNetV2-S, ConvNeXt-T, MobileViT v2 (XXS/XS/S), MobileNetV3-L, TinyViT-21M, and RepVGG-A2. All models are trained on the ImageNet-1K dataset under consistent training settings, with an emphasis on real-time practicality. An comprehensive ablation study is undertaken to separate the effect of critical hyperparameters, including learning rate schedules, batch sizes, input resolution, data augmentation, regularization approaches, and optimizer choice. To assess appropriateness for real-time applications, each model is assessed not only in terms of Top-1 and Top-5 classification accuracy, but also in terms of inference time, parameter count, model size, and frames-per-second (FPS) on a GPU-accelerated edge deployment simulation. Results demonstrate that cosine learning rate decay and adjustable batch size may greatly boost both accuracy and convergence speed, while keeping low latency and memory cost. Notably, RepVGG-A2 achieves over 80% Top-1 accuracy with efficient inference performance, offering a compelling balance between accuracy and deployment cost for VGG-style models. The results give practical guidance for constructing resource-efficient deep learning models appropriate for real-time image processing pipelines. All code and training logs are publicly accessible at https://github.com/VineetKumarRakesh/lcnn-opt.
中文: 本研究评估了超参数调整对七种高效深度学习模型在实时图像分类中性能的影响,发现如余弦学习率衰减等策略可在保持低资源消耗的同时提升准确性和收敛速度。
English: This study evaluates how hyperparameter tuning affects the performance of seven efficient deep learning models for real-time image classification, finding that strategies like cosine learning rate decay enhance accuracy and speed while maintaining low resource use.
Authors:Alfio Ferrara, Sergio Picascia, Elisabetta Rocchetti
Abstract:
Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.
中文: 本研究通过交叉注意力分析发现,基于Transformer的文生图扩散模型在没有明确训练指导的情况下,能自发区分艺术创作中的内容与风格——内容词主要聚焦物体区域,而风格词则影响背景纹理,揭示了模型对艺术概念的内在表征机制。
English: This study explores how transformer-based text-to-image diffusion models internally represent artistic concepts like content and style, revealing through cross-attention analysis that they exhibit emergent separation between object-focused content tokens and texture-affecting style tokens without explicit training guidance.
Authors:Xihang Hu, Fuming Sun, Jiazhe Liu, Feilong Xu, Xiaoli Zhang
Abstract:
Semi-supervised Camouflaged Object Detection (SSCOD) aims to reduce reliance on costly pixel-level annotations by leveraging limited annotated data and abundant unlabeled data. However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. Specifically, ST-SAM employs Self-Training strategy that dynamically filters and expands high-confidence pseudo-labels to enhance a single-model architecture, thereby fundamentally circumventing inter-model prediction bias. Furthermore, by transforming pseudo-labels into hybrid prompts containing domain-specific knowledge, ST-SAM effectively harnesses the Segment Anything Model's potential for specialized tasks to mitigate error accumulation in self-training. Experiments on COD benchmark datasets demonstrate that ST-SAM achieves state-of-the-art performance with only 1\% labeled data, outperforming existing SSCOD methods and even matching fully supervised methods. Remarkably, ST-SAM requires training only a single network, without relying on specific models or loss functions. This work establishes a new paradigm for annotation-efficient SSCOD. Codes will be available at https://github.com/hu-xh/ST-SAM.
Chinese: ST-SAM提出了一种自训练框架,通过高置信度伪标签和混合提示利用Segment Anything模型,仅需少量标注数据和单一网络即可实现最先进的半监督伪装目标检测。
English: ST-SAM introduces a self-training framework that uses high-confidence pseudo-labels and hybrid prompts to leverage the Segment Anything Model, achieving state-of-the-art semi-supervised camouflaged object detection with minimal labeled data and a single network.
Authors:Hanshen Zhu, Zhen Zhu, Kaile Zhang, Yiming Gong, Yuliang Liu, Xiang Bai
Abstract:
We tackle the task of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped while preserving overall scene coherence. Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, proving difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine. In experiments on our new GeoBench benchmark, which contains both 2D and 3D editing scenarios, FreeFine outperforms state-of-the-art alternatives in image fidelity, and edit precision, especially under demanding transformations. Code and benchmark are available at: https://github.com/CIawevy/FreeFine
中文摘要:本文提出一种解耦的几何图像编辑流程,通过无需训练的扩散方法FreeFine分别处理对象变换、源区域修复和目标区域优化,在新型GeoBench基准测试中展现出卓越的图像保真度和编辑精度。
English Summary: This paper introduces a decoupled pipeline for geometric image editing that separates object transformation, inpainting, and refinement using a training-free diffusion method called FreeFine, which demonstrates superior performance in image fidelity and edit precision on the new GeoBench benchmark.
Authors:Kazushi Kato, Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara
Abstract:
In human dialogue, nonverbal information such as nodding and facial expressions is as crucial as verbal information, and spoken dialogue systems are also expected to express such nonverbal behaviors. We focus on nodding, which is critical in an attentive listening system, and propose a model that predicts both its timing and type in real time. The proposed model builds on the voice activity projection (VAP) model, which predicts voice activity from both listener and speaker audio. We extend it to prediction of various types of nodding in a continuous and real-time manner unlike conventional models. In addition, the proposed model incorporates multi-task learning with verbal backchannel prediction and pretraining on general dialogue data. In the timing and type prediction task, the effectiveness of multi-task learning was significantly demonstrated. We confirmed that reducing the processing rate enables real-time operation without a substantial drop in accuracy, and integrated the model into an avatar attentive listening system. Subjective evaluations showed that it outperformed the conventional method, which always does nodding in sync with verbal backchannel. The code and trained models are available at https://github.com/MaAI-Kyoto/MaAI.
中文摘要:本研究提出了一种实时点头预测模型,通过扩展VAP模型并采用多任务学习,能连续预测点头时机与类型,在专注倾听系统中表现优于传统同步点头方法。
English Summary: This study introduces a real-time nodding prediction model for attentive listening systems, which extends the VAP model to continuously predict nodding timing and types using multi-task learning and achieves better performance than conventional methods.
Authors:RJ Skerry-Ryan, Julian Salazar, Soroosh Mariooryad, David Kao, Daisy Stanton, Eric Battenberg, Matt Shannon, Ron J. Weiss, Robin Scheibler, Jonas Rothfuss, Tom Bagby
Abstract:
We introduce a neural network layer API and library for sequence modeling, designed for easy creation of sequence models that can be executed both layer-by-layer (e.g., teacher-forced training) and step-by-step (e.g., autoregressive sampling). To achieve this, layers define an explicit representation of their state over time (e.g., a Transformer KV cache, a convolution buffer, an RNN hidden state), and a step method that evolves that state, tested to give identical results to a stateless layer-wise invocation. This and other aspects of the SequenceLayers contract enables complex models to be immediately streamable, mitigates a wide range of common bugs arising in both streaming and parallel sequence processing, and can be implemented in any deep learning library. A composable and declarative API, along with a comprehensive suite of layers and combinators, streamlines the construction of production-scale models from simple streamable components while preserving strong correctness guarantees. Our current implementations of SequenceLayers (JAX, TensorFlow 2) are available at https://github.com/google/sequence-layers.
中文: 本文介绍了一种用于序列建模的神经网络层API和库,通过定义明确的状态表示和逐步更新方法,支持逐层和逐步两种执行模式,确保结果一致并减少流式与并行处理中的常见错误。
English: This paper presents a neural network layer API and library for sequence modeling that enables both layer-by-layer and step-by-step execution by defining explicit state representations and step methods, ensuring identical results and mitigating common bugs in streaming and parallel processing.
Authors:Dohwan Ko, Ji Soo Lee, Minhyuk Choi, Zihang Meng, Hyunwoo J. Kim
Abstract:
Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.
中文摘要:本研究提出的BLiM框架结合候选先验归一化(CPN),通过双向似然估计解决文本-视频检索中的候选先验偏差问题,在多个基准测试中实现了最优性能。
English Summary: The proposed BLiM framework with Candidate Prior Normalization (CPN) addresses candidate prior bias in text-video retrieval by leveraging bidirectional likelihood estimation, achieving state-of-the-art performance across multiple benchmarks.
Authors:Dohwan Ko, Ji Soo Lee, Minhyuk Choi, Zihang Meng, Hyunwoo J. Kim
Abstract:
Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.
中文摘要:本研究提出的BLiM框架结合候选先验归一化(CPN),通过双向似然估计解决文本-视频检索中的候选先验偏差问题,在多个基准测试中实现了最优性能。
English Summary: The proposed BLiM framework with Candidate Prior Normalization (CPN) addresses candidate prior bias in text-video retrieval by leveraging bidirectional likelihood estimation, achieving state-of-the-art performance across multiple benchmarks.
Authors:Zunhai Su, Qingyuan Li, Hao Zhang, YuLei Qian, Yuchen Xie, Kehong Yuan
Abstract:
Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE LLMs. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model's forward inference. These experts are prevalent in open-source MoE LLMs, and despite their limited number, pruning them leads to a significant decline in model performance (e.g., pruning three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model's overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning. The code is available at https://github.com/ZunhaiSu/Super-Experts-Profilling.
中文摘要:本研究在稀疏激活的专家混合模型中首次发现并分析了一类关键子集——超级专家,其剪除会通过破坏注意力机制显著降低模型性能,尤其在数学推理任务中表现突出。
English Summary: This study identifies a critical subset of experts called Super Experts (SEs) in Mixture-of-Experts models, whose pruning severely degrades model performance by disrupting attention mechanisms, particularly in mathematical reasoning tasks.
Authors:Jiawei Liu, Chenwang Wu, Defu Lian, Enhong Chen
Abstract:
Due to growing privacy concerns, machine unlearning, which aims at enabling machine learning models to ``forget" specific training data, has received increasing attention. Among existing methods, influence-based unlearning has emerged as a prominent approach due to its ability to estimate the impact of individual training samples on model parameters without retraining. However, this approach suffers from prohibitive computational overhead arising from the necessity to compute the Hessian matrix and its inverse across all training samples and parameters, rendering it impractical for large-scale models and scenarios involving frequent data deletion requests. This highlights the difficulty of forgetting. Inspired by cognitive science, which suggests that memorizing is easier than forgetting, this paper establishes a theoretical link between memorizing (incremental learning) and forgetting (unlearning). This connection allows machine unlearning to be addressed from the perspective of incremental learning. Unlike the time-consuming Hessian computations in unlearning (forgetting), incremental learning (memorizing) typically relies on more efficient gradient optimization, which supports the aforementioned cognitive theory. Based on this connection, we introduce the Influence Approximation Unlearning (IAU) algorithm for efficient machine unlearning from the incremental perspective. Extensive empirical evaluations demonstrate that IAU achieves a superior balance among removal guarantee, unlearning efficiency, and comparable model utility, while outperforming state-of-the-art methods across diverse datasets and model architectures. Our code is available at https://github.com/Lolo1222/IAU.
Chinese: 本文提出了影响近似遗忘(IAU)算法,通过建立增量学习与遗忘之间的理论联系,在保持模型性能的同时高效移除特定训练数据,克服了传统基于影响的遗忘方法存在的计算瓶颈。
English: This paper introduces the Influence Approximation Unlearning (IAU) algorithm, which leverages the connection between incremental learning and unlearning to efficiently remove specific training data from machine learning models while maintaining performance, overcoming the computational challenges of traditional influence-based methods.
Authors:Shimanto Bhowmik, Tawsif Tashwar Dipto, Md Sazzad Islam, Sheryl Hsu, Tahsin Reasat
Abstract:
Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient \& concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide. The code and dataset used in this research is publicly available at https://github.com/BengaliAI/bn-llm-benchmark.
中文:本研究揭示了孟加拉语自然语言处理因缺乏标准化基准和过度分词导致的性能差距,发现DeepSeek等架构表现稳健而小型模型效果欠佳,强调了改进多语言评估方法的必要性。
English: This study identifies the performance gaps in Bengali NLP due to a lack of standardized benchmarks and excessive tokenization, revealing that models like DeepSeek show robustness while smaller models struggle, underscoring the need for improved multilingual evaluation methods.
Authors:Yang Ren, Hai Jiang, Wei Li, Menglong Yang, Heng Zhang, Zehua Sheng, Qingsheng Ye, Shuaicheng Liu
Abstract:
Image downscaling is critical for efficient storage and transmission of high-resolution (HR) images. Existing learning-based methods focus on performing downscaling within the sRGB domain, which typically suffers from blurred details and unexpected artifacts. RAW images, with their unprocessed photonic information, offer greater flexibility but lack specialized downscaling frameworks. In this paper, we propose a wavelet-based recurrent reconstruction framework that leverages the information lossless attribute of wavelet transformation to fulfill the arbitrary-scale RAW image downscaling in a coarse-to-fine manner, in which the Low-Frequency Arbitrary-Scale Downscaling Module (LASDM) and the High-Frequency Prediction Module (HFPM) are proposed to preserve structural and textural integrity of the reconstructed low-resolution (LR) RAW images, alongside an energy-maximization loss to align high-frequency energy between HR and LR domain. Furthermore, we introduce the Realistic Non-Integer RAW Downscaling (Real-NIRD) dataset, featuring a non-integer downscaling factor of 1.3$\times$, and incorporate it with publicly available datasets with integer factors (2$\times$, 3$\times$, 4$\times$) for comprehensive benchmarking arbitrary-scale image downscaling purposes. Extensive experiments demonstrate that our method outperforms existing state-of-the-art competitors both quantitatively and visually. The code and dataset will be released at https://github.com/RenYangSCU/ASRD.
中文摘要:本文提出了一种基于小波变换的循环重建框架,通过低频任意尺度下采样模块和高频预测模块实现RAW图像的无损下采样,并在包含非整数缩放因子的新数据集上验证了该方法在定量和视觉评估中均优于现有技术。
English Summary: This paper introduces a wavelet-based recurrent reconstruction framework for arbitrary-scale RAW image downscaling that preserves structural and textural integrity through specialized modules and an energy-maximization loss, outperforming existing methods as demonstrated on the newly created Real-NIRD dataset.
Authors:Wei-Wei Du, Takuma Udagawa, Kei Tateno
Abstract:
Time intervals between purchasing items are a crucial factor in sequential recommendation tasks, whereas existing approaches focus on item sequences and often overlook by assuming the intervals between items are static. However, dynamic intervals serve as a dimension that describes user profiling on not only the history within a user but also different users with the same item history. In this work, we propose IntervalLLM, a novel framework that integrates interval information into LLM and incorporates the novel interval-infused attention to jointly consider information of items and intervals. Furthermore, unlike prior studies that address the cold-start scenario only from the perspectives of users and items, we introduce a new viewpoint: the interval perspective to serve as an additional metric for evaluating recommendation methods on the warm and cold scenarios. Extensive experiments on 3 benchmarks with both traditional- and LLM-based baselines demonstrate that our IntervalLLM achieves not only 4.4% improvements in average but also the best-performing warm and cold scenarios across all users, items, and the proposed interval perspectives. In addition, we observe that the cold scenario from the interval perspective experiences the most significant performance drop among all recommendation methods. This finding underscores the necessity of further research on interval-based cold challenges and our integration of interval information in the realm of sequential recommendation tasks. Our code is available here: https://github.com/sony/ds-research-code/tree/master/recsys25-IntervalLLM.
中文摘要:IntervalLLM是一种创新框架,将购买间的动态时间间隔融入序列推荐,通过间隔增强注意力机制提升用户画像构建,在常规和冷启动场景下均实现最优性能表现。
English Summary: IntervalLLM is a novel framework that integrates dynamic time intervals between purchases into sequential recommendations, enhancing user profiling and achieving superior performance in both warm and cold scenarios through interval-infused attention.
Authors:Youngsun Jang, Dongyoun Kim, Chulwoo Pack, Kwanghee Won
Abstract:
This study introduces a novel dataset for segmenting flooded areas in satellite images. After reviewing 77 existing benchmarks utilizing satellite imagery, we identified a shortage of suitable datasets for this specific task. To fill this gap, we collected satellite imagery of the 2019 Midwestern USA floods from Planet Explorer by Planet Labs (Image \c{opyright} 2024 Planet Labs PBC). The dataset consists of 10 satellite images per location, each containing both flooded and non-flooded areas. We selected ten locations from each of the five states: Iowa, Kansas, Montana, Nebraska, and South Dakota. The dataset ensures uniform resolution and resizing during data processing. For evaluating semantic segmentation performance, we tested state-of-the-art models in computer vision and remote sensing on our dataset. Additionally, we conducted an ablation study varying window sizes to capture temporal characteristics. Overall, the models demonstrated modest results, suggesting a requirement for future multimodal and temporal learning strategies. The dataset will be publicly available on .
Chinese: 本研究针对卫星图像中洪水区域分割任务,创建了一个基于2019年美国中西部洪水的新数据集,并通过评估先进模型发现结果一般,表明未来需发展多模态和时间序列学习策略。
English: This study introduces a new dataset for flood area segmentation in satellite imagery, addressing a gap in existing resources by compiling images from the 2019 Midwestern USA floods and evaluating state-of-the-art models, which showed modest results indicating a need for improved multimodal and temporal learning approaches.
Authors:Viraj Joshi, Zifan Xu, Bo Liu, Peter Stone, Amy Zhang
Abstract:
Multi-task Reinforcement Learning (MTRL) has emerged as a critical training paradigm for applying reinforcement learning (RL) to a set of complex real-world robotic tasks, which demands a generalizable and robust policy. At the same time, \emph{massively parallelized training} has gained popularity, not only for significantly accelerating data collection through GPU-accelerated simulation but also for enabling diverse data collection across multiple tasks by simulating heterogeneous scenes in parallel. However, existing MTRL research has largely been limited to off-policy methods like SAC in the low-parallelization regime. MTRL could capitalize on the higher asymptotic performance of on-policy algorithms, whose batches require data from the current policy, and as a result, take advantage of massive parallelization offered by GPU-accelerated simulation. To bridge this gap, we introduce a massively parallelized $\textbf{M}$ulti-$\textbf{T}$ask $\textbf{Bench}$mark for robotics (MTBench), an open-sourced benchmark featuring a broad distribution of 50 manipulation tasks and 20 locomotion tasks, implemented using the GPU-accelerated simulator IsaacGym. MTBench also includes four base RL algorithms combined with seven state-of-the-art MTRL algorithms and architectures, providing a unified framework for evaluating their performance. Our extensive experiments highlight the superior speed of evaluating MTRL approaches using MTBench, while also uncovering unique challenges that arise from combining massive parallelism with MTRL. Code is available at https://github.com/Viraj-Joshi/MTBench
多任务强化学习可通过在线策略算法与GPU加速的大规模并行化显著提升性能,为此我们开发了开源基准平台MTBench,用于在多样化机器人任务中评估各类MTRL方法的综合表现。
Multi-task reinforcement learning can significantly benefit from on-policy algorithms and GPU-accelerated massive parallelization, leading to the development of MTBench, an open-source benchmark that evaluates various MTRL methods across diverse robotic tasks.
Authors:Xiaochen Zhao, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xiu Li, Linjie Luo, Jinli Suo, Yebin Liu
Abstract:
We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.
中文:X-NeMo是一种基于零样本扩散的肖像动画流程,通过提取驱动视频中的一维身份无关潜在运动描述符,能够以不同个体的面部动作驱动静态肖像动画,有效解决了身份泄露问题并精细捕捉各种表情。
English: X-NeMo is a zero-shot diffusion-based portrait animation pipeline that uses a 1D identity-agnostic latent motion descriptor to animate static portraits with facial movements from different individuals, effectively addressing identity leakage and capturing subtle expressions through an end-to-end training framework.
Authors:Xinwei Wu, Haojie Li, Hongyu Liu, Xinyu Ji, Ruohan Li, Yule Chen, Yigeng Zhang
Abstract:
In this work, we study a critical research problem regarding the trustworthiness of large language models (LLMs): how LLMs behave when encountering ambiguous narrative text, with a particular focus on Chinese textual ambiguity. We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations. These annotated examples are systematically categorized into 3 main categories and 9 subcategories. Through experiments, we discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans. Specifically, LLMs cannot reliably distinguish ambiguous text from unambiguous text, show overconfidence in interpreting ambiguous text as having a single meaning rather than multiple meanings, and exhibit overthinking when attempting to understand the various possible meanings. Our findings highlight a fundamental limitation in current LLMs that has significant implications for their deployment in real-world applications where linguistic ambiguity is common, calling for improved approaches to handle uncertainty in language understanding. The dataset and code are publicly available at this GitHub repository: https://github.com/ictup/LLM-Chinese-Textual-Disambiguation.
中文摘要:该研究发现大型语言模型在处理中文文本歧义时表现出显著脆弱性,难以区分歧义、过度自信于单一解释且对多种含义过度思考,揭示了其在现实应用中的关键局限性。
English Summary: This research reveals that large language models exhibit significant fragility when processing ambiguous Chinese text, struggling with distinguishing ambiguity, showing overconfidence in single interpretations, and overthinking multiple meanings, highlighting a critical limitation for real-world applications.
Authors:Richard Williams, Eric Nalisnick, Andrew Holbrook
Abstract:
Weighted graphs are ubiquitous throughout biology, chemistry, and the social sciences, motivating the development of generative models for abstract weighted graph data using deep neural networks. However, most current deep generative models are either designed for unweighted graphs and are not easily extended to weighted topologies or incorporate edge weights without consideration of a joint distribution with topology. Furthermore, learning a distribution over weighted graphs must account for complex nonlocal dependencies between both the edges of the graph and corresponding weights of each edge. We develop an autoregressive model BiGG-E, a nontrivial extension of the BiGG model, that learns a joint distribution over weighted graphs while still exploiting sparsity to generate a weighted graph with $n$ nodes and $m$ edges in $O((n + m)\log n)$ time. Simulation studies and experiments on a variety of benchmark datasets demonstrate that BiGG-E best captures distributions over weighted graphs while remaining scalable and computationally efficient.
Chinese: BiGG-E 是一种自回归模型,通过学习加权图的联合分布并利用稀疏性,在 O((n + m) log n) 时间内高效生成图结构,同时在捕捉复杂图分布方面优于现有方法。
English: BiGG-E is an autoregressive model that learns the joint distribution of weighted graphs by exploiting sparsity, enabling efficient generation in O((n + m) log n) time while outperforming existing methods in capturing complex graph distributions.
Authors:Dmitry Demidov, Zaigham Zaheer, Omkar Thawakar, Salman Khan, Fahad Shahbaz Khan
Abstract:
Fine-grained image classification, the task of distinguishing between visually similar subcategories within a broader category (e.g., bird species, car models, flower types), is a challenging computer vision problem. Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms, limiting their scalability and adaptability in real-world settings where novel classes frequently emerge. Recent research has demonstrated that combining large language models (LLMs) with vision-language models (VLMs) makes open-set recognition possible without the need for predefined class labels. However, the existing methods are often limited in harnessing the power of LLMs at the classification phase, and also rely heavily on the guessed class names provided by an LLM without thorough analysis and refinement. To address these bottlenecks, we propose our training-free method, Enriched-FineR (or E-FineR for short), which demonstrates state-of-the-art results in fine-grained visual recognition while also offering greater interpretability, highlighting its strong potential in real-world scenarios and new domains where expert annotations are difficult to obtain. Additionally, we demonstrate the application of our proposed approach to zero-shot and few-shot classification, where it demonstrated performance on par with the existing SOTA while being training-free and not requiring human interventions. Overall, our vocabulary-free framework supports the shift in image classification from rigid label prediction to flexible, language-driven understanding, enabling scalable and generalizable systems for real-world applications. Well-documented code is available on https://github.com/demidovd98/e-finer.
中文摘要:本文提出的免训练E-FineR方法通过结合大语言模型与视觉语言模型,无需预定义标签即可实现最先进的细粒度图像分类,在零样本和少样本场景中展现出卓越的 interpretability 与性能表现。
English Summary: The proposed training-free E-FineR method combines large language and vision-language models to achieve state-of-the-art fine-grained image classification without predefined labels, offering superior interpretability and performance in zero-shot and few-shot scenarios.
Authors:Ruslan Khrulev
Abstract:
This paper introduces a novel benchmark, EGE-Math Solutions Assessment Benchmark, for evaluating Vision-Language Models (VLMs) on their ability to assess hand-written mathematical solutions. Unlike existing benchmarks that focus on problem solving, our approach centres on understanding student solutions, identifying mistakes, and assigning grades according to fixed criteria. We compile 122 scanned solutions from the Russian Unified State Exam (EGE) together with official expert grades, and evaluate seven modern VLMs from Google, OpenAI, Arcee AI, and Alibaba Cloud in three inference modes. The results reveal current limitations in mathematical reasoning and human-rubric alignment, opening new research avenues in AI-assisted assessment. You can find code in https://github.com/Karifannaa/Auto-check-EGE-math
中文: 本文提出EGE-Math解题评估基准,这一新型评估工具专注于对手写数学解题过程进行评分而非解题本身,通过测试七个先进视觉语言模型揭示了当前在数学推理能力方面的局限。
English: This paper presents the EGE-Math Solutions Assessment Benchmark, a novel evaluation tool for Vision-Language Models that focuses on grading handwritten math solutions rather than solving problems, revealing current limitations in mathematical reasoning through testing seven modern VLMs.
Authors:Harry Shomer, Jiejun Xu
Abstract:
Label placement is a critical aspect of map design, serving as a form of spatial annotation that directly impacts clarity and interpretability. Despite its importance, label placement remains largely manual and difficult to scale, as existing automated systems struggle to integrate cartographic conventions, adapt to context, or interpret labeling instructions. In this work, we introduce a new paradigm for automatic label placement (ALP) that formulates the task as a data editing problem and leverages large language models (LLMs) for context-aware spatial annotation. To support this direction, we curate MAPLE, the first known benchmarking dataset for evaluating ALP on real-world maps, encompassing diverse landmark types and label placement annotations from open-source data. Our method retrieves labeling guidelines relevant to each landmark type leveraging retrieval-augmented generation (RAG), integrates them into prompts, and employs instruction-tuned LLMs to generate ideal label coordinates. We evaluate four open-source LLMs on MAPLE, analyzing both overall performance and generalization across different types of landmarks. This includes both zero-shot and instruction-tuned performance. Our results demonstrate that LLMs, when guided by structured prompts and domain-specific retrieval, can learn to perform accurate spatial edits, aligning the generated outputs with expert cartographic standards. Overall, our work presents a scalable framework for AI-assisted map finishing and demonstrates the potential of foundation models in structured data editing tasks. The code and data can be found at https://github.com/HarryShomer/MAPLE.
中文摘要:本研究提出了一种利用大型语言模型结合制图规范的新型自动标签放置方法,通过构建MAPLE基准数据集实现了符合专业标准的地图空间标注。
English Summary: This paper introduces a novel automatic label placement method using large language models guided by cartographic rules, achieving expert-level spatial annotations through a curated benchmark dataset called MAPLE.
Authors:Shou'ang Wei, Xinyun Wang, Shuzhen Bi, Jian Chen, Ruijia Li, Bo Jiang, Xin Lin, Min Zhang, Yu Song, BingDong Li, Aimin Zhou, Hao Hao
Abstract:
The emergence of Large Language Models (LLMs) presents transformative opportunities for education, generating numerous novel application scenarios. However, significant challenges remain: evaluation metrics vary substantially across different educational scenarios, while many emerging scenarios lack appropriate assessment metrics. Current benchmarks predominantly measure general intelligence rather than pedagogical capabilities. To address this gap, we introduce ELMES, an open-source automated evaluation framework specifically designed for assessing LLMs in educational settings. ELMES features a modular architecture that enables researchers to create dynamic, multi-agent dialogues through simple configuration files, facilitating flexible scenario design without requiring extensive programming expertise. The framework incorporates a hybrid evaluation engine that objectively quantifies traditionally subjective pedagogical metrics using an LLM-as-a-Judge methodology. We conduct systematic benchmarking of state-of-the-art LLMs across four critical educational scenarios: Knowledge Point Explanation, Guided Problem-Solving Teaching, Interdisciplinary Lesson Plan Generation, and Contextualized Question Generation, employing fine-grained metrics developed in collaboration with education specialists. Our results demonstrate distinct capability distributions among models, revealing context-specific strengths and limitations. ELMES provides educators and researchers with an accessible evaluation framework that significantly reduces adaptation barriers for diverse educational applications while advancing the practical implementation of LLMs in pedagogy. The framework is publicly available at \emph{https://github.com/sii-research/elmes.git}.
大型语言模型为教育带来变革机遇但存在评估挑战,为此开发了ELMES开源框架,通过模块化设计和混合指标实现灵活的多智能体教育场景评估。
Large Language Models (LLMs) offer transformative potential for education but face evaluation challenges, leading to the development of ELMES, an open-source framework that enables flexible, multi-agent educational assessments through modular design and hybrid metrics.
Authors:Yixuan Mi, Yiduo Yu, Yiyi Zhao
Abstract:
We present SmartCourse, an integrated course management and AI-driven advising system for undergraduate students (specifically tailored to the Computer Science (CPS) major). SmartCourse addresses the limitations of traditional advising tools by integrating transcript and plan information for student-specific context. The system combines a command-line interface (CLI) and a Gradio web GUI for instructors and students, manages user accounts, course enrollment, grading, and four-year degree plans, and integrates a locally hosted large language model (via Ollama) for personalized course recommendations. It leverages transcript and major plan to offer contextual advice (e.g., prioritizing requirements or retakes). We evaluated the system on 25 representative advising queries and introduced custom metrics: PlanScore, PersonalScore, Lift, and Recall to assess recommendation quality across different context conditions. Experiments show that using full context yields substantially more relevant recommendations than context-omitted modes, confirming the necessity of transcript and plan information for personalized academic advising. SmartCourse thus demonstrates how transcript-aware AI can enhance academic planning.
中文: SmartCourse是一个集成课程管理与AI驱动的学业指导系统,通过结合学生成绩单和专业计划提供个性化课程推荐,评估表明完整上下文信息比无上下文模式能显著提升推荐相关性。
English: SmartCourse is an AI-powered academic advising system that integrates student transcripts and degree plans to deliver personalized course recommendations, with evaluations confirming that full contextual data significantly improves recommendation relevance over context-free approaches.
Authors:Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu
Abstract:
Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.
中文摘要:医疗大语言模型在眼科诊断中存在因知识不足和多模态数据缺乏导致的幻觉问题,EH-Benchmark通过将幻觉分类为视觉理解与逻辑组合两大类型,并采用包含知识检索与结果验证的三阶段多智能体框架,显著提升了诊断准确性与可靠性。
English Summary: Medical Large Language Models (MLLMs) face accuracy limitations in ophthalmology due to hallucinations from insufficient knowledge and multimodal data, which the proposed EH-Benchmark and multi-agent framework effectively mitigate by categorizing and addressing these errors through knowledge retrieval and validation stages.
Authors:Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu
Abstract:
Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.
中文摘要:医疗大语言模型在眼科诊断中存在因知识不足和多模态数据缺乏导致的幻觉问题,EH-Benchmark通过将幻觉分类为视觉理解与逻辑组合两大类型,并采用包含知识检索与结果验证的三阶段多智能体框架,显著提升了诊断准确性与可靠性。
English Summary: Medical Large Language Models (MLLMs) face accuracy limitations in ophthalmology due to hallucinations from insufficient knowledge and multimodal data, which the proposed EH-Benchmark and multi-agent framework effectively mitigate by categorizing and addressing these errors through knowledge retrieval and validation stages.
Authors:Zhehao Tan, Yihan Jiao, Dan Yang, Lei Liu, Jie Feng, Duolin Sun, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu
Abstract:
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM's ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textit{Placeholder-RAG-Benchmark}, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs' roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM's parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system's generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems. Our code is available in https://github.com/Alipay-Med/PRGB.
中文: 该摘要介绍了Placeholder-RAG-Benchmark这一多层级评估基准,通过过滤、整合和推理三个递进维度系统评估大语言模型在RAG系统中的文档利用能力,同时采用占位符方法分离参数知识与外部知识贡献,揭示了模型在错误恢复和上下文忠实度方面的局限性。
English: The abstract introduces Placeholder-RAG-Benchmark, a multi-level evaluation framework that systematically assesses LLMs' document utilization in RAG systems through filtering, combination, and reasoning dimensions, while decoupling parametric and external knowledge contributions to reveal limitations in error resilience and context faithfulness.
Authors:Jindong Li, Yali Fu, Jiahong Liu, Linxiao Cao, Wei Ji, Menglin Yang, Irwin King, Ming-Hsuan Yang
Abstract:
The rapid advancement of large language models (LLMs) has intensified the need for effective mechanisms to transform continuous multimodal data into discrete representations suitable for language-based processing. Discrete tokenization, with vector quantization (VQ) as a central approach, offers both computational efficiency and compatibility with LLM architectures. Despite its growing importance, there is a lack of a comprehensive survey that systematically examines VQ techniques in the context of LLM-based systems. This work fills this gap by presenting the first structured taxonomy and analysis of discrete tokenization methods designed for LLMs. We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines. Beyond algorithm-level investigation, we discuss existing research in terms of classical applications without LLMs, LLM-based single-modality systems, and LLM-based multimodal systems, highlighting how quantization strategies influence alignment, reasoning, and generation performance. In addition, we identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints. Finally, we discuss emerging research directions such as dynamic and task-adaptive quantization, unified tokenization frameworks, and biologically inspired codebook learning. This survey bridges the gap between traditional vector quantization and modern LLM applications, serving as a foundational reference for the development of efficient and generalizable multimodal systems. A continuously updated version is available at: https://github.com/jindongli-Ai/LLM-Discrete-Tokenization-Survey.
中文: 本综述首次系统分析了面向大语言模型的离散标记化向量量化技术,对方法进行分类并探讨融合挑战,同时指明了未来研究方向。
English: This survey provides the first comprehensive analysis of vector quantization techniques for discrete tokenization in large language models, categorizing methods and addressing integration challenges while outlining future research directions.
Authors:Kwun Hang Lau, Ruiyuan Zhang, Weijie Shi, Xiaofang Zhou, Xiaojun Cheng
Abstract:
While Retrieval-Augmented Generation (RAG) excels at injecting static, factual knowledge into Large Language Models (LLMs), it exhibits a critical deficit in handling longitudinal queries that require tracking entities and phenomena across time. This blind spot arises because conventional, semantically-driven retrieval methods are not equipped to gather evidence that is both topically relevant and temporally coherent for a specified duration. We address this challenge by proposing a new framework that fundamentally redesigns the RAG pipeline to infuse temporal logic. Our methodology begins by disentangling a user's query into its core subject and its temporal window. It then employs a specialized retriever that calibrates semantic matching against temporal relevance, ensuring the collection of a contiguous evidence set that spans the entire queried period. To enable rigorous evaluation of this capability, we also introduce the Analytical Diachronic Question Answering Benchmark (ADQAB), a challenging evaluation suite grounded in a hybrid corpus of real and synthetic financial news. Empirical results on ADQAB show that our approach yields substantial gains in answer accuracy, surpassing standard RAG implementations by 13% to 27%. This work provides a validated pathway toward RAG systems capable of performing the nuanced, evolutionary analysis required for complex, real-world questions. The dataset and code for this study are publicly available at https://github.com/kwunhang/TA-RAG.
Chinese Summary: 本文提出了一种新颖框架,通过引入时间逻辑增强检索增强生成(RAG)系统处理纵向查询的能力,在基准测试中相比标准RAG实现准确率提升13%至27%。
English Summary: This paper introduces a novel framework that enhances Retrieval-Augmented Generation (RAG) by incorporating temporal logic to effectively address longitudinal queries, achieving significant accuracy improvements of 13% to 27% over standard RAG systems.
Authors:Siqi Luo, Haoran Yang, Yi Xin, Mingyang Yi, Guangyang Wu, Guangtao Zhai, Xiaohong Liu
Abstract:
Large pre-trained models achieve remarkable performance in vision tasks but are impractical for fine-tuning due to high computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods mitigate this issue by updating only a subset of parameters; however, most existing approaches are task-agnostic, failing to fully exploit task-specific adaptations, which leads to suboptimal efficiency and performance. To address this limitation, we propose Task-Relevant Parameter and Token Selection (TR-PTS), a task-driven framework that enhances both computational efficiency and accuracy. Specifically, we introduce Task-Relevant Parameter Selection, which utilizes the Fisher Information Matrix (FIM) to identify and fine-tune only the most informative parameters in a layer-wise manner, while keeping the remaining parameters frozen. Simultaneously, Task-Relevant Token Selection dynamically preserves the most informative tokens and merges redundant ones, reducing computational overhead. By jointly optimizing parameters and tokens, TR-PTS enables the model to concentrate on task-discriminative information. We evaluate TR-PTS on benchmark, including FGVC and VTAB-1k, where it achieves state-of-the-art performance, surpassing full fine-tuning by 3.40% and 10.35%, respectively. The code are available at https://github.com/synbol/TR-PTS.
中文: 提出的TR-PTS框架通过选择性更新任务相关参数和标记来增强视觉模型微调,在基准测试中以更高效率实现了最先进的性能。
English: The proposed TR-PTS framework enhances vision model fine-tuning by selectively updating task-relevant parameters and tokens, achieving state-of-the-art performance with improved efficiency on benchmark datasets.
Authors:Haichuan Hu, Xiaochen Xie, Quanjun Zhang
Abstract:
APR (Automated Program Repair) aims to automatically locate program defects, generate patches and validate the repairs. Existing techniques for APR are often combined with LLMs (Large Language Models), which leverages the code-related knowledge of LLMs to improve repair effectiveness. Current LLM-based APR methods typically utilize test cases only during the inference stage, adopting an iterative approach that performs repair first and validates it through test execution afterward. This conventional paradigm neglects two important aspects: the potential contribution of test cases in the training phase, and the possibility of leveraging testing prior to repair. To address this, we propose Repair-R1, which introduces test cases into the model's training phase and shifts test generation to precede repair. The model is required to first generate discriminative test cases that can distinguish defective behaviors, and then perform repair based on these tests. This enables the model to better locate defects and understand the underlying causes of defects, thereby improving repair effectiveness. We implement Repair-R1 with three different backbone models, using RL (reinforcement learning) to co-optimize test generation and bug repair. Experimental results on four widely adopted benchmarks demonstrate the superiority of Repair-R1. Specially, compared to vanilla models, Repair-R1 improves repair success rate by 2.68\% to 48.29\%, test generation success rate by 16.38\% to 53.28\%, and test coverage by 0.78\% to 53.96\%. We publish the code and weights at https://github.com/Tomsawyerhu/APR-RL and https://huggingface.co/tomhu/Qwen3-4B-RL-5000-step.
中文: 提出的Repair-R1方法通过在模型训练阶段引入测试用例并优先进行测试生成,显著提升了自动程序修复的成功率和覆盖率,在多个基准测试中表现优异。
English: The proposed Repair-R1 method enhances automated program repair by incorporating test cases during model training and prioritizing test generation before repair, which significantly improves success rates and coverage across multiple benchmarks.
Authors:Yang Luo, Haoyang Luan, Haoyun Pan, Yongquan Jia, Xiaofeng Gao, Guihai Chen
Abstract:
Accurate quality prediction in multi-process manufacturing is critical for industrial efficiency but hindered by three core challenges: time-lagged process interactions, overlapping operations with mixed periodicity, and inter-process dependencies in shared frequency bands. To address these, we propose PAF-Net, a frequency decoupled time series prediction framework with three key innovations: (1) A phase-correlation alignment method guided by frequency domain energy to synchronize time-lagged quality series, resolving temporal misalignment. (2) A frequency independent patch attention mechanism paired with Discrete Cosine Transform (DCT) decomposition to capture heterogeneous operational features within individual series. (3) A frequency decoupled cross attention module that suppresses noise from irrelevant frequencies, focusing exclusively on meaningful dependencies within shared bands. Experiments on 4 real-world datasets demonstrate PAF-Net's superiority. It outperforms 10 well-acknowledged baselines by 7.06% lower MSE and 3.88% lower MAE. Our code is available at https://github.com/StevenLuan904/PAF-Net-Official.
中文:PAF-Net提出了一种频率解耦的时间序列预测框架,通过相位相关对齐、频率无关的补丁注意力机制和解耦交叉注意力模块,有效解决了多工序制造中的时序滞后和频域干扰问题,在四个真实数据集上以MSE降低7.06%和MAE降低3.88%的表现显著优于现有基准方法。
English: PAF-Net introduces a frequency decoupled time series prediction framework that overcomes multi-process manufacturing challenges through phase-correlation alignment, frequency independent patch attention, and cross attention modules, achieving superior performance with 7.06% lower MSE and 3.88% lower MAE than existing baselines.
Authors:Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue
Abstract:
Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups. To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation. The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis. This design improves robustness, interpretability, and fidelity over end-to-end black-box methods. Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs. Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.
中文: 本文提出了一种模块化的多智能体框架,通过定位、规划和生成三个可解释阶段将用户界面设计转化为前端代码,显著提升了鲁棒性,并在布局准确性和代码质量方面达到了领先水平。
English: This paper introduces a modular multi-agent framework that transforms UI designs into front-end code through three interpretable stages—grounding, planning, and generation—improving robustness and achieving state-of-the-art performance in layout accuracy and code quality.
Authors:Yuhe Wang, Min Wang, Zhihang Xu
Abstract:
Porous electrodes are widely used in electrochemical systems, where accurately determining electric potentials, particularly overpotentials, is essential for understanding electrode behavior. At the macroscopic scale, porous electrodes are typically modeled using a dual-continuum approach, treating the porous solid phase and the liquid electrolyte as spatially superimposed domains. Determining potential distributions requires solving two Poisson equations that are nonlinearly coupled through Butler-Volmer kinetics under galvanostatic and potentiostatic operating modes. Under galvanostatic operation, these equations form an underconstrained singular system due to all-Neumann boundary conditions, posing numerical challenges. This paper systematically presents numerical methods for solving nonlinearly coupled Poisson equations in dual-continuum porous electrodes, with a particular focus on galvanostatic solutions. We mathematically establish solution uniqueness in terms of the potential difference between the electrode and electrolyte (or overpotential), as well as the individual potentials up to a shared constant shift. To resolve the nonuniqueness of the solution, we introduce three numerical approaches: (1) Lagrange Constrained Method (LCM), (2) Dirichlet Substitution Method (DSM), and (3) Global Constraining Method (GCM), where GCM enables solving the overpotential without imposing an explicit system reference potential. Additionally, we develop both decoupled and fully coupled nonlinear solution strategies and evaluate their computational performance in both homogeneous and heterogeneous conductivity cases. The presented numerical methods are general for addressing similar underconstrained nonlinear systems. A Python implementation is provided at https://github.com/harrywang1129/porous_electrode_solver.
中文: 本文系统提出了求解双连续多孔电极中非线性耦合泊松方程的数值方法,重点针对恒电流操作模式,通过三种数值策略解决解的唯一性问题,并开发了解耦与全耦合计算方案及开源代码实现。
English: This paper introduces numerical methods for solving nonlinearly coupled Poisson equations in dual-continuum porous electrodes, focusing on galvanostatic operation and addressing solution uniqueness through three approaches while providing computational strategies and open-source implementation.
Authors:Hossein Mirzaei, Zeinab Taghavi, Sepehr Rezaee, Masoud Hadi, Moein Madadi, Mackenzie W. Mathis
Abstract:
Deep neural networks have demonstrated remarkable success across numerous tasks, yet they remain vulnerable to Trojan (backdoor) attacks, raising serious concerns about their safety in real-world mission-critical applications. A common countermeasure is trigger inversion -- reconstructing malicious "shortcut" patterns (triggers) inserted by an adversary during training. Current trigger-inversion methods typically search the full pixel space under specific assumptions but offer no assurances that the estimated trigger is more than an adversarial perturbation that flips the model output. Here, we propose a data-free, zero-shot trigger-inversion strategy that restricts the search space while avoiding strong assumptions on trigger appearance. Specifically, we incorporate a diffusion-based generator guided by the target classifier; through iterative generation, we produce candidate triggers that align with the internal representations the model relies on for malicious behavior. Empirical evaluations, both quantitative and qualitative, show that our approach reconstructs triggers that effectively distinguish clean versus Trojaned models. DISTIL surpasses alternative methods by high margins, achieving up to 7.1% higher accuracy on the BackdoorBench dataset and a 9.4% improvement on trojaned object detection model scanning, offering a promising new direction for reliable backdoor defense without reliance on extensive data or strong prior assumptions about triggers. The code is available at https://github.com/AdaptiveMotorControlLab/DISTIL.
中文: 本文提出DISTIL方法,这是一种无需数据、零样本的触发器反演策略,通过扩散生成器在目标分类器引导下重构后门触发器,无需大量数据或强假设即能有效区分正常与中毒模型,在多项测试中以显著优势超越现有方法。
English: This paper introduces DISTIL, a data-free, zero-shot trigger-inversion method that uses a diffusion-based generator guided by the target classifier to reconstruct Trojan triggers effectively, outperforming existing approaches with significant accuracy improvements in backdoor detection without relying on extensive data or strong trigger assumptions.
Authors:Dongli He, Hu Wang, Mohammad Yaqub
Abstract:
Accurate fetal biometric measurements, such as abdominal circumference, play a vital role in prenatal care. However, obtaining high-quality ultrasound images for these measurements heavily depends on the expertise of sonographers, posing a significant challenge in low-income countries due to the scarcity of trained personnel. To address this issue, we leverage FetalCLIP, a vision-language model pretrained on a curated dataset of over 210,000 fetal ultrasound image-caption pairs, to perform automated fetal ultrasound image quality assessment (IQA) on blind-sweep ultrasound data. We introduce FetalCLIP$_{CLS}$, an IQA model adapted from FetalCLIP using Low-Rank Adaptation (LoRA), and evaluate it on the ACOUSLIC-AI dataset against six CNN and Transformer baselines. FetalCLIP$_{CLS}$ achieves the highest F1 score of 0.757. Moreover, we show that an adapted segmentation model, when repurposed for classification, further improves performance, achieving an F1 score of 0.771. Our work demonstrates how parameter-efficient fine-tuning of fetal ultrasound foundation models can enable task-specific adaptations, advancing prenatal care in resource-limited settings. The experimental code is available at: https://github.com/donglihe-hub/FetalCLIP-IQA.
中文: 本研究采用FetalCLIP视觉语言模型,通过低秩适应技术实现胎儿超声图像质量自动评估,以0.771的F1分数展现优异性能,为资源有限地区的产前护理提供了有效解决方案。
English: The study introduces FetalCLIP, a vision-language model adapted using Low-Rank Adaptation for automated fetal ultrasound image quality assessment, achieving superior performance with an F1 score of 0.771 and demonstrating potential to enhance prenatal care in resource-limited areas.
Authors:Hang Su, Yunlong Feng, Daniel Gehrig, Panfeng Jiang, Ling Gao, Xavier Lagorce, Laurent Kneip
Abstract:
Structure and continuous motion estimation from point correspondences is a fundamental problem in computer vision that has been powered by well-known algorithms such as the familiar 5-point or 8-point algorithm. However, despite their acclaim, these algorithms are limited to processing point correspondences originating from a pair of views each one representing an instantaneous capture of the scene. Yet, in the case of rolling shutter cameras, or more recently, event cameras, this synchronization breaks down. In this work, we present a unified approach for structure and linear motion estimation from 2D point correspondences with arbitrary timestamps, from an arbitrary set of views. By formulating the problem in terms of first-order dynamics and leveraging a constant velocity motion model, we derive a novel, linear point incidence relation allowing for the efficient recovery of both linear velocity and 3D points with predictable degeneracies and solution multiplicities. Owing to its general formulation, it can handle correspondences from a wide range of sensing modalities such as global shutter, rolling shutter, and event cameras, and can even combine correspondences from different collocated sensors. We validate the effectiveness of our solver on both simulated and real-world data, where we show consistent improvement across all modalities when compared to recent approaches. We believe our work opens the door to efficient structure and motion estimation from asynchronous data. Code can be found at https://github.com/suhang99/AsyncTrack-Motion-Solver.
中文: 本文提出了一种统一方法,可从任意时间戳的二维点对应中估计结构和线性运动,有效适用于全局快门、卷帘快门和事件相机等多种传感器。
English: This paper introduces a unified method for estimating structure and linear motion from 2D point correspondences with arbitrary timestamps, enabling efficient processing across various camera types including global shutter, rolling shutter, and event cameras.
Authors:Jie He, Victor Gutiérrez-Basulto, Jeff Z. Pan
Abstract:
Reinforcement learning-based retrieval-augmented generation (RAG) methods enhance the reasoning abilities of large language models (LLMs). However, most rely only on final-answer rewards, overlooking intermediate reasoning quality. This paper analyzes existing RAG reasoning models and identifies three main failure patterns: (1) information insufficiency, meaning the model fails to retrieve adequate support; (2) faulty reasoning, where logical or content-level flaws appear despite sufficient information; and (3) answer-reasoning inconsistency, where a valid reasoning chain leads to a mismatched final answer. We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system to improve reasoning and stability. TIRESRAG-R1 introduces: (1) a sufficiency reward to encourage thorough retrieval; (2) a reasoning quality reward to assess the rationality and accuracy of the reasoning chain; and (3) a reflection reward to detect and revise errors. It also employs a difficulty-aware reweighting strategy and training sample filtering to boost performance on complex tasks. Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks. The code and data are available at: https://github.com/probe2/TIRESRAG-R1.
中文: 本文提出TIRESRAG-R1框架,通过多维奖励机制和“思考-检索-反思”流程改进检索增强生成方法,有效解决了信息不足、错误推理和答案不一致三大问题,在多项复杂问答任务中表现优异。
English: This paper introduces TIRESRAG-R1, a reinforcement learning framework that enhances retrieval-augmented generation by addressing common failure patterns through a multi-dimensional reward system and think-retrieve-reflect process, demonstrating superior performance on multi-hop QA tasks.
Authors:Thuy Tran, Ruochen Chen, Shaifali Parashar
Abstract:
Shape-from-Template (SfT) refers to the class of methods that reconstruct the 3D shape of a deforming object from images/videos using a 3D template. Traditional SfT methods require point correspondences between images and the texture of the 3D template in order to reconstruct 3D shapes from images/videos in real time. Their performance severely degrades when encountered with severe occlusions in the images because of the unavailability of correspondences. In contrast, modern SfT methods use a correspondence-free approach by incorporating deep neural networks to reconstruct 3D objects, thus requiring huge amounts of data for supervision. Recent advances use a fully unsupervised or self-supervised approach by combining differentiable physics and graphics to deform 3D template to match input images. In this paper, we propose an unsupervised SfT which uses only image observations: color features, gradients and silhouettes along with a mesh inextensibility constraint to reconstruct at a $400\times$ faster pace than (best-performing) unsupervised SfT. Moreover, when it comes to generating finer details and severe occlusions, our method outperforms the existing methodologies by a large margin. Code is available at https://github.com/dvttran/nsft.
中文: 我们提出的无监督形状模板法仅利用图像特征和网格约束,以比最优方法快400倍的速度重建三维物体,并在处理精细细节和严重遮挡方面远超现有技术。
English: Our unsupervised Shape-from-Template method reconstructs 3D objects 400 times faster than leading approaches by using only image features and mesh constraints, while significantly outperforming existing methods in handling fine details and severe occlusions.
Authors:Jia Li, Yang Wang, Wenhao Qian, Jialong Hu, Zhenzhen Hu, Richang Hong, Meng Wang
Abstract:
Interview performance assessment is essential for determining candidates' suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365'' aspects of interview performance by integrating \textit{three} modalities (video, audio, and text), \textit{six} responses per candidate, and \textit{five} key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at https://github.com/MSA-LMC/365Aspects.
中文: 本研究提出了一种全面的“365”多模态面试评估框架,通过特征融合和集成学习整合视频、音频和文本数据,对六个回答和五个评估维度进行分析,实现了稳健且无偏见的评估,并在AVI 2025挑战赛中荣获第一名验证了其有效性。
English: This study introduces a comprehensive "365" framework for multimodal interview assessment, integrating video, audio, and text data across six responses and five evaluation dimensions through feature fusion and ensemble learning to achieve robust, unbiased evaluations, as demonstrated by its first-place performance in the AVI Challenge 2025.
Authors:Shenghao Zhu, Yifei Chen, Weihong Chen, Yuanhan Wang, Chang Liu, Shuo Jiang, Feiwei Qin, Changmiao Wang
Abstract:
Accurate and reliable brain tumor segmentation, particularly when dealing with missing modalities, remains a critical challenge in medical image analysis. Previous studies have not fully resolved the challenges of tumor boundary segmentation insensitivity and feature transfer in the absence of key imaging modalities. In this study, we introduce MST-KDNet, aimed at addressing these critical issues. Our model features Multi-Scale Transformer Knowledge Distillation to effectively capture attention weights at various resolutions, Dual-Mode Logit Distillation to improve the transfer of knowledge, and a Global Style Matching Module that integrates feature matching with adversarial learning. Comprehensive experiments conducted on the BraTS and FeTS 2024 datasets demonstrate that MST-KDNet surpasses current leading methods in both Dice and HD95 scores, particularly in conditions with substantial modality loss. Our approach shows exceptional robustness and generalization potential, making it a promising candidate for real-world clinical applications. Our source code is available at https://github.com/Quanato607/MST-KDNet.
Chinese: MST-KDNet通过多尺度Transformer知识蒸馏和全局风格匹配技术,有效提升了脑肿瘤分割在模态缺失情况下的准确性和鲁棒性,在主流数据集上超越了现有最优方法。
English: MST-KDNet introduces multi-scale transformer knowledge distillation and global style matching to enhance brain tumor segmentation accuracy and robustness, particularly under missing modalities, outperforming state-of-the-art methods on major datasets.
Authors:Daniil Gurgurov, Katharina Trinley, Ivan Vykopal, Josef van Genabith, Simon Ostermann, Roberto Zamparelli
Abstract:
Large language models (LLMs) are increasingly used in everyday tools and applications, raising concerns about their potential influence on political views. While prior research has shown that LLMs often exhibit measurable political biases--frequently skewing toward liberal or progressive positions--key gaps remain. Most existing studies evaluate only a narrow set of models and languages, leaving open questions about the generalizability of political biases across architectures, scales, and multilingual settings. Moreover, few works examine whether these biases can be actively controlled.
In this work, we address these gaps through a large-scale study of political orientation in modern open-source instruction-tuned LLMs. We evaluate seven models, including LLaMA-3.1, Qwen-3, and Aya-Expanse, across 14 languages using the Political Compass Test with 11 semantically equivalent paraphrases per statement to ensure robust measurement. Our results reveal that larger models consistently shift toward libertarian-left positions, with significant variations across languages and model families. To test the manipulability of political stances, we utilize a simple center-of-mass activation intervention technique and show that it reliably steers model responses toward alternative ideological positions across multiple languages. Our code is publicly available at https://github.com/d-gurgurov/Political-Ideologies-LLMs.
中文摘要:研究表明大型开源语言模型在多语言环境中普遍呈现自由左翼政治倾向,并证实可通过特定干预技术有效操控其政治立场。
English Summary: This study reveals that larger open-source language models consistently exhibit libertarian-left political biases across multiple languages, and demonstrates that these biases can be systematically manipulated through targeted interventions.
Authors:Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, Simon Ostermann
Abstract:
Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity.
Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal "fallback" mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.
中文: 本研究识别出大语言模型中的语言特定神经元,并通过语言算术操作证明能有效控制多语言任务行为,其效果受语言资源丰富度和类型相似性影响。
English: This study identifies language-specific neurons in large language models and demonstrates that manipulating them through language arithmetics effectively controls multilingual behavior across various tasks, with performance influenced by language resources and typological similarity.
Authors:Takuma Amada, Kazuya Kakizaki, Taiki Miyagawa, Akinori F. Ebihara, Kaede Shiohara, Toshihiko Yamasaki
Abstract:
In this paper, we present a deepfake detection algorithm specifically designed for electronic Know Your Customer (eKYC) systems. To ensure the reliability of eKYC systems against deepfake attacks, it is essential to develop a robust deepfake detector capable of identifying both face swapping and face reenactment, while also being robust to image degradation. We address these challenges through three key contributions: (1)~Our approach evaluates the video's authenticity by detecting temporal inconsistencies in identity vectors extracted by face recognition models, leading to comprehensive detection of both face swapping and face reenactment. (2)~In addition to processing video input, the algorithm utilizes a registered image (assumed to be genuine) to calculate identity discrepancies between the input video and the registered image, significantly improving detection accuracy. (3)~We find that employing a face feature extractor trained on a larger dataset enhances both detection performance and robustness against image degradation. Our experimental results show that our proposed method accurately detects both face swapping and face reenactment comprehensively and is robust against various forms of unseen image degradation. Our source code is publicly available https://github.com/TaikiMiyagawa/DeepfakeDetection4eKYC.
中文: 本文提出一种面向电子客户识别的深度伪造检测算法,通过分析人脸识别模型提取身份向量的时序不一致性,并利用注册图像计算身份差异,能够全面检测换脸和表情操纵,且对图像质量退化具有强鲁棒性。
English: This paper introduces a deepfake detection algorithm for eKYC systems that identifies face swapping and reenactment by analyzing temporal inconsistencies in identity vectors and leveraging registered images for enhanced accuracy, while demonstrating robustness against image degradation.
Authors:Inaya Rahmanisa, Lyzander Marciano Andrylie, Mahardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji
Abstract:
Language-specific neurons in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them. However, their role in amplification remains underexplored. This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages. We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES). The optimal amplification factors effectively steer output toward nearly all tested languages. Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer.
中文摘要:通过增强多语言大模型中的语言特定神经元,能有效引导输出转向目标语言,虽对低资源语言性能有所提升,但普遍削弱了跨语言推理、知识和翻译任务的迁移效果。
English Summary: Amplifying language-specific neurons in multilingual LLMs effectively steers outputs toward target languages, enhancing performance for low-resource languages but generally impairing cross-lingual transfer across reasoning, knowledge, and translation tasks.
Authors:Galadrielle Humblot-Renaux, Gianni Franchi, Sergio Escalera, Thomas B. Moeslund
Abstract:
Out-of-distribution (OOD) detection is an important building block in trustworthy image recognition systems as unknown classes may arise at test-time. OOD detection methods typically revolve around a single classifier, leading to a split in the research field between the classical supervised setting (e.g. ResNet18 classifier trained on CIFAR100) vs. the zero-shot setting (class names fed as prompts to CLIP). In both cases, an overarching challenge is that the OOD detection performance is implicitly constrained by the classifier's capabilities on in-distribution (ID) data. In this work, we show that given a little open-mindedness from both ends, remarkable OOD detection can be achieved by instead creating a heterogeneous ensemble - COOkeD combines the predictions of a closed-world classifier trained end-to-end on a specific dataset, a zero-shot CLIP classifier, and a linear probe classifier trained on CLIP image features. While bulky at first sight, this approach is modular, post-hoc and leverages the availability of pre-trained VLMs, thus introduces little overhead compared to training a single standard classifier. We evaluate COOkeD on popular CIFAR100 and ImageNet benchmarks, but also consider more challenging, realistic settings ranging from training-time label noise, to test-time covariate shift, to zero-shot shift which has been previously overlooked. Despite its simplicity, COOkeD achieves state-of-the-art performance and greater robustness compared to both classical and CLIP-based OOD detection methods. Code is available at https://github.com/glhr/COOkeD
中文: COOkeD提出了一种异构集成方法,融合了闭域分类器、零样本CLIP分类器和线性探针分类器,在多种挑战性场景下实现了最先进的分布外检测性能并展现出更强的鲁棒性。
English: COOkeD introduces a heterogeneous ensemble method combining closed-world, zero-shot CLIP, and linear probe classifiers to achieve state-of-the-art OOD detection performance with enhanced robustness across diverse challenging scenarios.
Authors:Shijing Chen, Xinrui Zhou, Yuhao Wang, Yuhao Huang, Ao Chang, Dong Ni, Ruobing Huang
Abstract:
Accurate identification of breast lesion subtypes can facilitate personalized treatment and interventions. Ultrasound (US), as a safe and accessible imaging modality, is extensively employed in breast abnormality screening and diagnosis. However, the incidence of different subtypes exhibits a skewed long-tailed distribution, posing significant challenges for automated recognition. Generative augmentation provides a promising solution to rectify data distribution. Inspired by this, we propose a dual-phase framework for long-tailed classification that mitigates distributional bias through high-fidelity data synthesis while avoiding overuse that corrupts holistic performance. The framework incorporates a reinforcement learning-driven adaptive sampler, dynamically calibrating synthetic-real data ratios by training a strategic multi-agent to compensate for scarcities of real data while ensuring stable discriminative capability. Furthermore, our class-controllable synthetic network integrates a sketch-grounded perception branch that harnesses anatomical priors to maintain distinctive class features while enabling annotation-free inference. Extensive experiments on an in-house long-tailed and a public imbalanced breast US datasets demonstrate that our method achieves promising performance compared to state-of-the-art approaches. More synthetic images can be found at https://github.com/Stinalalala/Breast-LT-GenAug.
Chinese: 本研究提出了一种双阶段框架,通过生成式增强和强化学习解决乳腺超声病灶分类中的长尾分布问题,利用自适应数据合成和类别可控特征保持技术,实现了优越的分类性能。
English: This study introduces a dual-phase framework that uses generative augmentation and reinforcement learning to address the long-tailed distribution challenge in breast ultrasound lesion classification, achieving superior performance through adaptive data synthesis and class-controllable feature preservation.
Authors:Weicheng Gao
Abstract:
This work is completed on a whim after discussions with my junior colleague. The motion direction angle affects the micro-Doppler spectrum width, thus determining the human motion direction can provide important prior information for downstream tasks such as gait recognition. However, Doppler-Time map (DTM)-based methods still have room for improvement in achieving feature augmentation and motion determination simultaneously. In response, a low-cost but accurate radar-based human motion direction determination (HMDD) method is explored in this paper. In detail, the radar-based human gait DTMs are first generated, and then the feature augmentation is achieved using feature linking model. Subsequently, the HMDD is implemented through a lightweight and fast Vision Transformer-Convolutional Neural Network hybrid model structure. The effectiveness of the proposed method is verified through open-source dataset. The open-source code of this work is released at: https://github.com/JoeyBGOfficial/Low-Cost-Accurate-Radar-Based-Human-Motion-Direction-Determination.
中文: 本文提出了一种低成本的雷达方法,通过结合视觉Transformer和CNN的混合模型增强微多普勒特征,以精确识别人体运动方向,并在开源数据集上验证了其有效性。
English: This paper introduces a low-cost radar-based method that uses a hybrid Vision Transformer-CNN model to enhance micro-Doppler features and accurately determine human motion direction, validated on an open-source dataset.
Authors:Joshua Dimasaka, Christian GeiÃ, Emily So
Abstract:
To understand our global progress for sustainable development and disaster risk reduction in many developing economies, two recent major initiatives - the Uniform African Exposure Dataset of the Global Earthquake Model (GEM) Foundation and the Modelling Exposure through Earth Observation Routines (METEOR) Project - implemented classical spatial disaggregation techniques to generate large-scale mapping of urban morphology using the information from various satellite imagery and its derivatives, geospatial datasets of the built environment, and subnational census statistics. However, the local discrepancy with well-validated census statistics and the propagated model uncertainties remain a challenge in such coarse-to-fine-grained mapping problems, specifically constrained by weak and conditional label supervision. Therefore, we present Deep Conditional Census-Constrained Clustering (DeepC4), a novel deep learning-based spatial disaggregation approach that incorporates local census statistics as cluster-level constraints while considering multiple conditional label relationships in a joint multitask learning of the patterns of satellite imagery. To demonstrate, compared to GEM and METEOR, we enhanced the quality of Rwandan maps of urban morphology, specifically building exposure and physical vulnerability, at the third-level administrative unit from the 2022 census. As the world approaches the conclusion of our global frameworks in 2030, our work has offered a new deep learning-based mapping technique towards a spatial auditing of our existing coarse-grained derived information at large scales.
中文摘要:DeepC4模型采用融合人口普查数据约束的深度学习方法,相比GEM和METEOR等传统技术,显著提升了卢旺达城市形态(特别是建筑暴露性与物理脆弱性)的精细制图质量。
English Summary: The DeepC4 model introduces a deep learning approach that integrates census data as constraints to improve spatial mapping accuracy, outperforming traditional methods like GEM and METEOR in generating detailed urban morphology maps for Rwanda.
Authors:Xincheng Yao, Yijun Yang, Kangwei Guo, Ruiqiang Xiao, Haipeng Zhou, Haisu Tao, Jian Yang, Lei Zhu
Abstract:
The segmentation of the hepatic vasculature in surgical videos holds substantial clinical significance in the context of hepatectomy procedures. However, owing to the dearth of an appropriate dataset and the inherently complex task characteristics, few researches have been reported in this domain. To address this issue, we first introduce a high quality frame-by-frame annotated hepatic vasculature dataset containing 35 long hepatectomy videos and 11442 high-resolution frames. On this basis, we propose a novel high-resolution video vasculature segmentation network, dubbed as HRVVS. We innovatively embed a pretrained visual autoregressive modeling (VAR) model into different layers of the hierarchical encoder as prior information to reduce the information degradation generated during the downsampling process. In addition, we designed a dynamic memory decoder on a multi-view segmentation network to minimize the transmission of redundant information while preserving more details between frames. Extensive experiments on surgical video datasets demonstrate that our proposed HRVVS significantly outperforms the state-of-the-art methods. The source code and dataset will be publicly available at \{https://github.com/scott-yjyang/HRVVS}.
中文: 本研究提出了一种高分辨率肝血管分割网络(HRVVS)并发布了新标注数据集,通过嵌入预训练VAR模型和动态记忆解码器,在手术视频分析中显著优于现有方法。
English: This study introduces a high-resolution hepatic vasculature segmentation network (HRVVS) and a new annotated dataset, which significantly outperforms existing methods by integrating a pretrained VAR model and a dynamic memory decoder to enhance surgical video analysis.
Authors:Ziyi Wang, Peiming Li, Hong Liu, Zhichao Deng, Can Wang, Jun Liu, Junsong Yuan, Mengyuan Liu
Abstract:
Natural Human-Robot Interaction (N-HRI) requires robots to recognize human actions at varying distances and states, regardless of whether the robot itself is in motion or stationary. This setup is more flexible and practical than conventional human action recognition tasks. However, existing benchmarks designed for traditional action recognition fail to address the unique complexities in N-HRI due to limited data, modalities, task categories, and diversity of subjects and environments. To address these challenges, we introduce ACTIVE (Action from Robotic View), a large-scale dataset tailored specifically for perception-centric robotic views prevalent in mobile service robots. ACTIVE comprises 30 composite action categories, 80 participants, and 46,868 annotated video instances, covering both RGB and point cloud modalities. Participants performed various human actions in diverse environments at distances ranging from 3m to 50m, while the camera platform was also mobile, simulating real-world scenarios of robot perception with varying camera heights due to uneven ground. This comprehensive and challenging benchmark aims to advance action and attribute recognition research in N-HRI. Furthermore, we propose ACTIVE-PC, a method that accurately perceives human actions at long distances using Multilevel Neighborhood Sampling, Layered Recognizers, Elastic Ellipse Query, and precise decoupling of kinematic interference from human actions. Experimental results demonstrate the effectiveness of ACTIVE-PC. Our code is available at: https://github.com/wangzy01/ACTIVE-Action-from-Robotic-View.
中文摘要:ACTIVE数据集针对自然人机交互中现有基准的不足,通过从移动平台采集包含RGB和点云模态的全面视频数据,覆盖不同距离和环境,旨在推动该领域的研究发展。
English Summary: The ACTIVE dataset is introduced to address the limitations of existing benchmarks in Natural Human-Robot Interaction by providing comprehensive video data with RGB and point cloud modalities, collected from mobile platforms across varying distances and environments.
Authors:Lei Sheng, Shuai-Shuai Xu
Abstract:
Large language models (LLMs) have demonstrated strong performance in translating natural language questions into SQL queries (Text-to-SQL). In contrast, small language models (SLMs) ranging from 0.5B to 1.5B parameters currently underperform on Text-to-SQL tasks due to their limited logical reasoning capabilities. However, SLMs offer inherent advantages in inference speed and suitability for edge deployment. To explore their potential in Text-to-SQL applications, we leverage recent advancements in post-training techniques. Specifically, we used the open-source SynSQL-2.5M dataset to construct two derived datasets: SynSQL-Think-916K for SQL generation and SynSQL-Merge-Think-310K for SQL merge revision. We then applied supervised fine-tuning and reinforcement learning-based post-training to the SLM, followed by inference using a corrective self-consistency approach. Experimental results validate the effectiveness and generalizability of our method, SLM-SQL. On the BIRD development set, the five evaluated models achieved an average improvement of 31.4 points. Notably, the 0.5B model reached 56.87\% execution accuracy (EX), while the 1.5B model achieved 67.08\% EX. We will release our dataset, model, and code to github: https://github.com/CycloneBoy/slm_sql.
中文: 大语言模型在文本转SQL任务中表现优异,而小模型虽落后但具备部署优势,本研究通过后训练技术显著提升了它们的性能。
English: Large language models excel in Text-to-SQL tasks, while smaller models lag but offer deployment advantages, prompting a study that successfully enhanced their performance through post-training techniques.
Authors:Hui Liu, Chen Jia, Fan Shi, Xu Cheng, Mengfei Shi, Xia Xie, Shengyong Chen
Abstract:
Achieving pixel-level segmentation with low computational cost using multimodal data remains a key challenge in crack segmentation tasks. Existing methods lack the capability for adaptive perception and efficient interactive fusion of cross-modal features. To address these challenges, we propose a Lightweight Adaptive Cue-Aware Vision Mamba network (LIDAR), which efficiently perceives and integrates morphological and textural cues from different modalities under multimodal crack scenarios, generating clear pixel-level crack segmentation maps. Specifically, LIDAR is composed of a Lightweight Adaptive Cue-Aware Visual State Space module (LacaVSS) and a Lightweight Dual Domain Dynamic Collaborative Fusion module (LD3CF). LacaVSS adaptively models crack cues through the proposed mask-guided Efficient Dynamic Guided Scanning Strategy (EDG-SS), while LD3CF leverages an Adaptive Frequency Domain Perceptron (AFDP) and a dual-pooling fusion strategy to effectively capture spatial and frequency-domain cues across modalities. Moreover, we design a Lightweight Dynamically Modulated Multi-Kernel convolution (LDMK) to perceive complex morphological structures with minimal computational overhead, replacing most convolutional operations in LIDAR. Experiments on three datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods. On the light-field depth dataset, our method achieves 0.8204 in F1 and 0.8465 in mIoU with only 5.35M parameters. Code and datasets are available at https://github.com/Karl1109/LIDAR-Mamba.
中文: 提出的轻量自适应线索感知视觉Mamba网络(LIDAR)通过自适应感知与融合模块有效整合多模态裂缝特征,以较低计算成本实现了优于现有方法的像素级分割效果。
English: The proposed Lightweight Adaptive Cue-Aware Vision Mamba network (LIDAR) effectively integrates multimodal crack features through adaptive perception and fusion modules, achieving superior pixel-level segmentation with minimal computational cost compared to existing methods.
Authors:Zheng Xiangyu, He Songcheng, Li Wanyun, Li Xiaoqiang, Zhang Wei
Abstract:
Unsupervised Video Object Segmentation (UVOS) aims to predict pixel-level masks for the most salient objects in videos without any prior annotations. While memory mechanisms have been proven critical in various video segmentation paradigms, their application in UVOS yield only marginal performance gains despite sophisticated design. Our analysis reveals a simple but fundamental flaw in existing methods: over-reliance on memorizing high-level semantic features. UVOS inherently suffers from the deficiency of lacking fine-grained information due to the absence of pixel-level prior knowledge. Consequently, memory design relying solely on high-level features, which predominantly capture abstract semantic cues, is insufficient to generate precise predictions. To resolve this fundamental issue, we propose a novel hierarchical memory architecture to incorporate both shallow- and high-level features for memory, which leverages the complementary benefits of pixel and semantic information. Furthermore, to balance the simultaneous utilization of the pixel and semantic memory features, we propose a heterogeneous interaction mechanism to perform pixel-semantic mutual interactions, which explicitly considers their inherent feature discrepancies. Through the design of Pixel-guided Local Alignment Module (PLAM) and Semantic-guided Global Integration Module (SGIM), we achieve delicate integration of the fine-grained details in shallow-level memory and the semantic representations in high-level memory. Our Hierarchical Memory with Heterogeneous Interaction Network (HMHI-Net) consistently achieves state-of-the-art performance across all UVOS and video saliency detection benchmarks. Moreover, HMHI-Net consistently exhibits high performance across different backbones, further demonstrating its superiority and robustness. Project page: https://github.com/ZhengxyFlow/HMHI-Net .
中文: 本文指出现有无监督视频对象分割方法过度依赖高级语义记忆的缺陷,提出HMHI-Net层次化记忆架构,通过异质交互机制融合浅层与深层特征,在多个基准测试中均取得最优性能。
English: This paper identifies the limitation of existing UVOS methods in over-relying on high-level semantic memory and proposes HMHI-Net, a hierarchical memory architecture with heterogeneous interaction that integrates both shallow and deep features to achieve state-of-the-art performance across benchmarks.
Authors:Jaeha Kim, Junghun Oh, Kyoung Mu Lee
Abstract:
Task-driven image restoration (TDIR) has recently emerged to address performance drops in high-level vision tasks caused by low-quality (LQ) inputs. Previous TDIR methods struggle to handle practical scenarios in which images are degraded by multiple complex factors, leaving minimal clues for restoration. This motivates us to leverage the diffusion prior, one of the most powerful natural image priors. However, while the diffusion prior can help generate visually plausible results, using it to restore task-relevant details remains challenging, even when combined with recent TDIR methods. To address this, we propose EDTR, which effectively harnesses the power of diffusion prior to restore task-relevant details. Specifically, we propose directly leveraging useful clues from LQ images in the diffusion process by generating from pixel-error-based pre-restored LQ images with mild noise added. Moreover, we employ a small number of denoising steps to prevent the generation of redundant details that dilute crucial task-related information. We demonstrate that our method effectively utilizes diffusion prior for TDIR, significantly enhancing task performance and visual quality across diverse tasks with multiple complex degradations.
中文: 提出的EDTR方法有效利用扩散先验,通过基于像素误差的预修复和可控去噪步骤,在多重复杂退化图像中恢复任务相关细节,显著提升了任务性能和视觉质量。
English: The proposed EDTR method effectively leverages diffusion prior to restore task-relevant details in images degraded by multiple complex factors, enhancing both task performance and visual quality through pixel-error-based restoration and controlled denoising steps.
Authors:Jiuming Liu, Zheng Huang, Mengmeng Liu, Tianchen Deng, Francesco Nex, Hao Cheng, Hesheng Wang
Abstract:
LiDAR scene generation is critical for mitigating real-world LiDAR data collection costs and enhancing the robustness of downstream perception tasks in autonomous driving. However, existing methods commonly struggle to capture geometric realism and global topological consistency. Recent LiDAR Diffusion Models (LiDMs) predominantly embed LiDAR points into the latent space for improved generation efficiency, which limits their interpretable ability to model detailed geometric structures and preserve global topological consistency. To address these challenges, we propose TopoLiDM, a novel framework that integrates graph neural networks (GNNs) with diffusion models under topological regularization for high-fidelity LiDAR generation. Our approach first trains a topological-preserving VAE to extract latent graph representations by graph construction and multiple graph convolutional layers. Then we freeze the VAE and generate novel latent topological graphs through the latent diffusion models. We also introduce 0-dimensional persistent homology (PH) constraints, ensuring the generated LiDAR scenes adhere to real-world global topological structures. Extensive experiments on the KITTI-360 dataset demonstrate TopoLiDM's superiority over state-of-the-art methods, achieving improvements of 22.6% lower Frechet Range Image Distance (FRID) and 9.2% lower Minimum Matching Distance (MMD). Notably, our model also enables fast generation speed with an average inference time of 1.68 samples/s, showcasing its scalability for real-world applications. We will release the related codes at https://github.com/IRMVLab/TopoLiDM.
中文: TopoLiDM是一种新颖的框架,通过在图神经网络与扩散模型中引入拓扑正则化,能够生成高保真度的激光雷达场景,在几何真实性和全局拓扑一致性方面显著优于现有方法。
English: TopoLiDM is a novel framework that integrates graph neural networks with diffusion models under topological regularization to generate high-fidelity LiDAR scenes, achieving superior geometric realism and global topological consistency compared to existing methods.
Authors:Youngho Kim, Hoonhee Cho, Kuk-Jin Yoon
Abstract:
Human pose estimation is critical for applications such as rehabilitation, sports analytics, and AR/VR systems. However, rapid motion and low-light conditions often introduce motion blur, significantly degrading pose estimation due to the domain gap between sharp and blurred images. Most datasets assume stable conditions, making models trained on sharp images struggle in blurred environments. To address this, we introduce a novel domain adaptation approach that leverages event cameras, which capture high temporal resolution motion data and are inherently robust to motion blur. Using event-based augmentation, we generate motion-aware blurred images, effectively bridging the domain gap between sharp and blurred domains without requiring paired annotations. Additionally, we develop a student-teacher framework that iteratively refines pseudo-labels, leveraging mutual uncertainty masking to eliminate incorrect labels and enable more effective learning. Experimental results demonstrate that our approach outperforms conventional domain-adaptive human pose estimation methods, achieving robust pose estimation under motion blur without requiring annotations in the target domain. Our findings highlight the potential of event cameras as a scalable and effective solution for domain adaptation in real-world motion blur environments. Our project codes are available at https://github.com/kmax2001/EvSharp2Blur.
中文: 本研究提出一种利用事件相机生成运动感知模糊图像的领域自适应方法,并结合师生框架与互不确定性掩码,在无需目标域标注的情况下实现了运动模糊下的鲁棒人体姿态估计。
English: This study introduces a domain adaptation method using event cameras to generate motion-aware blurred images and a student-teacher framework with mutual uncertainty masking, achieving robust human pose estimation under motion blur without target domain annotations.
Authors:Yixuan Nan, Xixun Lin, Yanmin Shang, Zhuofan Li, Can Zhao, Yanan Cao
Abstract:
Network alignment has attracted widespread attention in various fields. However, most existing works mainly focus on the problem of label sparsity, while overlooking the issue of noise in network alignment, which can substantially undermine model performance. Such noise mainly includes structural noise from noisy edges and labeling noise caused by human-induced and process-driven errors. To address these problems, we propose RANA, a Robust Active learning framework for noisy Network Alignment. RANA effectively tackles both structure noise and label noise while addressing the sparsity of anchor link annotations, which can improve the robustness of network alignment models. Specifically, RANA introduces the proposed Noise-aware Selection Module and the Label Denoising Module to address structural noise and labeling noise, respectively. In the first module, we design a noise-aware maximization objective to select node pairs, incorporating a cleanliness score to address structural noise. In the second module, we propose a novel multi-source fusion denoising strategy that leverages model and twin node pairs labeling to provide more accurate labels for node pairs. Empirical results on three real-world datasets demonstrate that RANA outperforms state-of-the-art active learning-based methods in alignment accuracy. Our code is available at https://github.com/YXNan0110/RANA.
Chinese: 提出的RANA框架通过噪声感知选择和标签去噪模块,有效解决了网络对齐中的结构噪声和标注噪声问题,在真实数据集上的对齐精度优于现有方法。
English: The proposed RANA framework enhances network alignment by addressing structural and labeling noise through its Noise-aware Selection and Label Denoising modules, outperforming existing methods in accuracy on real-world datasets.
Authors:Phi Van Nguyen, Ngoc Huynh Trinh, Duy Minh Lam Nguyen, Phu Loc Nguyen, Quoc Long Tran
Abstract:
Quantifying aleatoric uncertainty in medical image segmentation is critical since it is a reflection of the natural variability observed among expert annotators. A conventional approach is to model the segmentation distribution using the generative model, but current methods limit the expression ability of generative models. While current diffusion-based approaches have demonstrated impressive performance in approximating the data distribution, their inherent stochastic sampling process and inability to model exact densities limit their effectiveness in accurately capturing uncertainty. In contrast, our proposed method leverages conditional flow matching, a simulation-free flow-based generative model that learns an exact density, to produce highly accurate segmentation results. By guiding the flow model on the input image and sampling multiple data points, our approach synthesizes segmentation samples whose pixel-wise variance reliably reflects the underlying data distribution. This sampling strategy captures uncertainties in regions with ambiguous boundaries, offering robust quantification that mirrors inter-annotator differences. Experimental results demonstrate that our method not only achieves competitive segmentation accuracy but also generates uncertainty maps that provide deeper insights into the reliability of the segmentation outcomes. The code for this paper is freely available at https://github.com/huynhspm/Data-Uncertainty
中文摘要:本研究提出一种基于条件流匹配的方法,通过生成反映标注者间差异的多个分割样本来精确量化医学图像分割中的偶然不确定性,既实现了有竞争力的分割精度,又生成了具有深刻见解的不确定性图谱。
English Summary: This study introduces a method using conditional flow matching to accurately quantify aleatoric uncertainty in medical image segmentation by generating multiple segmentation samples that reflect inter-annotator variability, achieving both competitive accuracy and insightful uncertainty maps.
Authors:Phi Van Nguyen, Ngoc Huynh Trinh, Duy Minh Lam Nguyen, Phu Loc Nguyen, Quoc Long Tran
Abstract:
Quantifying aleatoric uncertainty in medical image segmentation is critical since it is a reflection of the natural variability observed among expert annotators. A conventional approach is to model the segmentation distribution using the generative model, but current methods limit the expression ability of generative models. While current diffusion-based approaches have demonstrated impressive performance in approximating the data distribution, their inherent stochastic sampling process and inability to model exact densities limit their effectiveness in accurately capturing uncertainty. In contrast, our proposed method leverages conditional flow matching, a simulation-free flow-based generative model that learns an exact density, to produce highly accurate segmentation results. By guiding the flow model on the input image and sampling multiple data points, our approach synthesizes segmentation samples whose pixel-wise variance reliably reflects the underlying data distribution. This sampling strategy captures uncertainties in regions with ambiguous boundaries, offering robust quantification that mirrors inter-annotator differences. Experimental results demonstrate that our method not only achieves competitive segmentation accuracy but also generates uncertainty maps that provide deeper insights into the reliability of the segmentation outcomes. The code for this paper is freely available at https://github.com/huynhspm/Data-Uncertainty
中文摘要:本研究提出一种基于条件流匹配的方法,通过生成反映标注者间差异的多个分割样本来精确量化医学图像分割中的偶然不确定性,既实现了有竞争力的分割精度,又生成了具有深刻见解的不确定性图谱。
English Summary: This study introduces a method using conditional flow matching to accurately quantify aleatoric uncertainty in medical image segmentation by generating multiple segmentation samples that reflect inter-annotator variability, achieving both competitive accuracy and insightful uncertainty maps.
Authors:Sijie Wang, Siqi Li, Yawei Zhang, Shangshu Yu, Shenghai Yuan, Rui She, Quanjiang Guo, JinXuan Zheng, Ong Kang Howe, Leonrich Chandra, Shrivarshann Srijeyan, Aditya Sivadas, Toshan Aggarwal, Heyuan Liu, Hongming Zhang, Chujie Chen, Junyu Jiang, Lihua Xie, Wee Peng Tay
Abstract:
Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs' surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARS-LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS). Our dataset is available at https://github.com/sijieaaa/UAVScenes
Chinese Summary: UAVScenes是一个新的大规模多模态数据集,它在MARS-LVIG数据集基础上增加了精细的语义标注和6自由度位姿信息,为无人机感知任务如分割和三维场景理解提供了支持。
English Summary: UAVScenes is a new large-scale multi-modal dataset that enhances the MARS-LVIG dataset with detailed semantic annotations and 6-DoF poses, enabling advanced UAV perception tasks like segmentation and 3D scene understanding.
Authors:Hyeonseok Moon, Heuiseok Lim
Abstract:
The Needle-in-a-Haystack (NIAH) benchmark is widely used to evaluate Large Language Models' (LLMs) ability to understand long contexts (LC). It evaluates the capability to identify query-relevant context within extensive query-irrelevant passages. Although this method serves as a widely accepted standard for evaluating long-context understanding, our findings suggest it may overestimate the true LC capability of LLMs. We demonstrate that even state-of-the-art models such as GPT-4o struggle to intactly incorporate given contexts made up of solely query-relevant ten sentences. In response, we introduce a novel benchmark, \textbf{NeedleChain}, where the context consists entirely of query-relevant information, requiring the LLM to fully grasp the input to answer correctly. Our benchmark allows for flexible context length and reasoning order, offering a more comprehensive analysis of LLM performance. Additionally, we propose an extremely simple yet compelling strategy to improve LC understanding capability of LLM: ROPE Contraction. Our experiments with various advanced LLMs reveal a notable disparity between their ability to process large contexts and their capacity to fully understand them. Source code and datasets are available at https://github.com/hyeonseokk/NeedleChain
中文: "大海捞针"基准可能高估了大语言模型的长上下文理解能力,因为即使先进模型也难以处理纯相关内容,为此我们提出了NeedleChain基准和ROPE收缩策略以更准确地评估和提升性能。
English: The Needle-in-a-Haystack benchmark may overestimate LLMs' long-context understanding, as even advanced models struggle with purely relevant contexts, prompting the introduction of NeedleChain and ROPE Contraction for more accurate evaluation and improvement.
Authors:Anubhav Kataria, Surbhi Madan, Shreya Ghosh, Tom Gedeon, Abhinav Dhall
Abstract:
Understanding individual, group and event level emotions along with contextual information is crucial for analyzing a multi-person social situation. To achieve this, we frame emotion comprehension as the task of predicting fine-grained individual emotion to coarse grained group and event level emotion. We introduce GEMS that leverages a multimodal swin-transformer and S3Attention based architecture, which processes an input scene, group members, and context information to generate joint predictions. Existing multi-person emotion related benchmarks mainly focus on atomic interactions primarily based on emotion perception over time and group level. To this end, we extend and propose VGAF-GEMS to provide more fine grained and holistic analysis on top of existing group level annotation of VGAF dataset. GEMS aims to predict basic discrete and continuous emotions (including valence and arousal) as well as individual, group and event level perceived emotions. Our benchmarking effort links individual, group and situational emotional responses holistically. The quantitative and qualitative comparisons with adapted state-of-the-art models demonstrate the effectiveness of GEMS framework on VGAF-GEMS benchmarking. We believe that it will pave the way of further research. The code and data is available at: https://github.com/katariaak579/GEMS
中文: GEMS框架通过多模态转换器和注意力机制,实现了从个体到群体及事件层面的细粒度情感预测,在扩展的VGAF-GEMS基准测试中展现出对社交场景情感分析的全面优势。
English: The GEMS framework utilizes multimodal transformers and attention mechanisms to predict fine-grained individual, group, and event-level emotions, demonstrating superior performance on the extended VGAF-GEMS benchmark for holistic social emotion analysis.
Authors:Jia Li, Yichao He, Jiacheng Xu, Tianhao Luo, Zhenzhen Hu, Richang Hong, Meng Wang
Abstract:
Accurate and reliable personality assessment plays a vital role in many fields, such as emotional intelligence, mental health diagnostics, and personalized education. Unlike fleeting emotions, personality traits are stable, often subconsciously leaked through language, facial expressions, and body behaviors, with asynchronous patterns across modalities. It was hard to model personality semantics with traditional superficial features and seemed impossible to achieve effective cross-modal understanding. To address these challenges, we propose a novel personality assessment framework called \textit{\textbf{Traits Run Deep}}. It employs \textit{\textbf{psychology-informed prompts}} to elicit high-level personality-relevant semantic representations. Besides, it devises a \textit{\textbf{Text-Centric Trait Fusion Network}} that anchors rich text semantics to align and integrate asynchronous signals from other modalities. To be specific, such fusion module includes a Chunk-Wise Projector to decrease dimensionality, a Cross-Modal Connector and a Text Feature Enhancer for effective modality fusion and an ensemble regression head to improve generalization in data-scarce situations. To our knowledge, we are the first to apply personality-specific prompts to guide large language models (LLMs) in extracting personality-aware semantics for improved representation quality. Furthermore, extracting and fusing audio-visual apparent behavior features further improves the accuracy. Experimental results on the AVI validation set have demonstrated the effectiveness of the proposed components, i.e., approximately a 45\% reduction in mean squared error (MSE). Final evaluations on the test set of the AVI Challenge 2025 confirm our method's superiority, ranking first in the Personality Assessment track. The source code will be made available at https://github.com/MSA-LMC/TraitsRunDeep.
中文摘要:提出的"Traits Run Deep"框架通过心理学引导提示和以文本为核心的多模态融合网络,实现了跨模态人格评估,在AVI验证集上均方误差降低约45%,并在2025年AVI挑战赛人格评估赛道荣获第一名。
English Summary: The proposed "Traits Run Deep" framework employs psychology-informed prompts and a text-centric fusion network to achieve cross-modal personality assessment, demonstrating superior performance with a 45% MSE reduction and winning first place in the AVI Challenge 2025.
Authors:Mykola Maslych, Mohammadreza Katebi, Christopher Lee, Yahya Hmaiti, Amirpouya Ghasemaghaei, Christian Pumarada, Janneese Palmer, Esteban Segarra Martinez, Marco Emporio, Warren Snipes, Ryan P. McMahan, Joseph J. LaViola
Abstract:
We investigated the challenges of mitigating response delays in free-form conversations with virtual agents powered by Large Language Models (LLMs) within Virtual Reality (VR). For this, we used conversational fillers, such as gestures and verbal cues, to bridge delays between user input and system responses and evaluate their effectiveness across various latency levels and interaction scenarios. We found that latency above 4 seconds degrades quality of experience, while natural conversational fillers improve perceived response time, especially in high-delay conditions. Our findings provide insights for practitioners and researchers to optimize user engagement whenever conversational systems' responses are delayed by network limitations or slow hardware. We also contribute an open-source pipeline that streamlines deploying conversational agents in virtual environments.
中文: 本研究探讨了在VR环境中使用如手势和言语提示等对话填充物来缓解大语言模型响应延迟的问题,发现超过4秒的延迟会损害用户体验,而自然的填充物能提升感知响应性,并为优化延迟交互提供了见解和开源工具链。
English: This study explores using conversational fillers like gestures and verbal cues to mitigate response delays in VR-based LLM conversations, finding that delays over 4 seconds harm user experience while natural fillers improve perceived responsiveness, with insights and an open-source pipeline provided for optimizing delayed interactions.
Authors:Pei Deng, Wenqian Zhou, Hanlin Wu
Abstract:
Accurate interpretation of land-cover changes in multi-temporal satellite imagery is critical for real-world scenarios. However, existing methods typically provide only one-shot change masks or static captions, limiting their ability to support interactive, query-driven analysis. In this work, we introduce remote sensing image change analysis (RSICA) as a new paradigm that combines the strengths of change detection and visual question answering to enable multi-turn, instruction-guided exploration of changes in bi-temporal remote sensing images. To support this task, we construct ChangeChat-105k, a large-scale instruction-following dataset, generated through a hybrid rule-based and GPT-assisted process, covering six interaction types: change captioning, classification, quantification, localization, open-ended question answering, and multi-turn dialogues. Building on this dataset, we propose DeltaVLM, an end-to-end architecture tailored for interactive RSICA. DeltaVLM features three innovations: (1) a fine-tuned bi-temporal vision encoder to capture temporal differences; (2) a visual difference perception module with a cross-semantic relation measuring (CSRM) mechanism to interpret changes; and (3) an instruction-guided Q-former to effectively extract query-relevant difference information from visual changes, aligning them with textual instructions. We train DeltaVLM on ChangeChat-105k using a frozen large language model, adapting only the vision and alignment modules to optimize efficiency. Extensive experiments and ablation studies demonstrate that DeltaVLM achieves state-of-the-art performance on both single-turn captioning and multi-turn interactive change analysis, outperforming existing multimodal large language models and remote sensing vision-language models. Code, dataset and pre-trained weights are available at https://github.com/hanlinwu/DeltaVLM.
中文: 本文提出RSICA新范式,融合变化检测与视觉问答,通过DeltaVLM模型和ChangeChat-105k数据集实现对双时相遥感图像的交互式变化分析,在多轮变化探索中达到最优性能。
English: This paper introduces RSICA, a new paradigm combining change detection and visual question answering to enable interactive analysis of bi-temporal remote sensing images, supported by the DeltaVLM model and ChangeChat-105k dataset, achieving state-of-the-art performance in multi-turn change exploration.
Authors:Zhicheng Song, Jinglan Xu, Chunxin Zheng, Yulin Li, Zhihai Bi, Jun Ma
Abstract:
Wheel-legged robots integrate the agility of legs for navigating rough terrains while harnessing the efficiency of wheels for smooth surfaces. However, most existing designs do not fully capitalize on the benefits of both legged and wheeled structures, which limits overall system flexibility and efficiency. We present FLORES (reconfigured wheel-legged robot for enhanced steering and adaptability), a novel wheel-legged robot design featuring a distinctive front-leg configuration that sets it beyond standard design approaches. Specifically, FLORES replaces the conventional hip-roll degree of freedom (DoF) of the front leg with hip-yaw DoFs, and this allows for efficient movement on flat surfaces while ensuring adaptability when navigating complex terrains. This innovative design facilitates seamless transitions between different locomotion modes (i.e., legged locomotion and wheeled locomotion) and optimizes the performance across varied environments. To fully exploit FLORES's mechanical capabilities, we develop a tailored reinforcement learning (RL) controller that adapts the Hybrid Internal Model (HIM) with a customized reward structure optimized for our unique mechanical configuration. This framework enables the generation of adaptive, multi-modal locomotion strategies that facilitate smooth transitions between wheeled and legged movements. Furthermore, our distinctive joint design enables the robot to exhibit novel and highly efficient locomotion gaits that capitalize on the synergistic advantages of both locomotion modes. Through comprehensive experiments, we demonstrate FLORES's enhanced steering capabilities, improved navigation efficiency, and versatile locomotion across various terrains. The open-source project can be found at https://github.com/ZhichengSong6/FLORES-A-Reconfigured-Wheel-Legged-Robot-for-Enhanced-Steering-and-Adaptability.git.
中文: FLORES提出了一种新颖的轮腿机器人设计,其独特的前腿配置将髋部滚动自由度替换为髋部偏航自由度,实现在平坦表面高效移动和在复杂地形灵活适应,同时通过定制的强化学习控制器实现轮式与腿式运动模式间的平滑切换,从而提升在不同环境中的转向与导航性能。
English: FLORES introduces a novel wheel-legged robot design with a unique front-leg configuration that replaces the hip-roll degree of freedom with hip-yaw DoFs, enabling efficient movement on flat surfaces and adaptability on complex terrains, while a tailored reinforcement learning controller facilitates smooth transitions between wheeled and legged locomotion modes for enhanced steering and navigation across varied environments.
Authors:Romulo B. da Silva, Diego Passos, Cássio M. Oishi, J. Nathan Kutz
Abstract:
We present CS-SHRED, a novel deep learning architecture that integrates Compressed Sensing (CS) into a Shallow Recurrent Decoder (SHRED) to reconstruct spatiotemporal dynamics from incomplete, compressed, or corrupted data. Our approach introduces two key innovations. First, by incorporating CS techniques into the SHRED architecture, our method leverages a batch-based forward framework with $\ell_1$ regularization to robustly recover signals even in scenarios with sparse sensor placements, noisy measurements, and incomplete sensor acquisitions. Second, an adaptive loss function dynamically combines Mean Squared Error (MSE) and Mean Absolute Error (MAE) terms with a piecewise Signal-to-Noise Ratio (SNR) regularization, which suppresses noise and outliers in low-SNR regions while preserving fine-scale features in high-SNR regions.
We validate CS-SHRED on challenging problems including viscoelastic fluid flows, maximum specific humidity fields, sea surface temperature distributions, and rotating turbulent flows. Compared to the traditional SHRED approach, CS-SHRED achieves significantly higher reconstruction fidelity -- as demonstrated by improved SSIM and PSNR values, lower normalized errors, and enhanced LPIPS scores-thereby providing superior preservation of small-scale structures and increased robustness against noise and outliers.
Our results underscore the advantages of the jointly trained CS and SHRED design architecture which includes an LSTM sequence model for characterizing the temporal evolution with a shallow decoder network (SDN) for modeling the high-dimensional state space. The SNR-guided adaptive loss function for the spatiotemporal data recovery establishes CS-SHRED as a promising tool for a wide range of applications in environmental, climatic, and scientific data analyses.
中文: CS-SHRED是一种创新的深度学习模型,将压缩感知与浅层循环解码器相结合,通过自适应损失函数从残缺或含噪数据中精确重建时空动态,在环境与科学数据分析中展现出卓越的鲁棒性和保真度。
English: CS-SHRED is a novel deep learning model that combines Compressed Sensing with a Shallow Recurrent Decoder to accurately reconstruct spatiotemporal data from incomplete or noisy measurements, using an adaptive loss function for enhanced robustness and fidelity across various applications.
Authors:Phuc Truong Loc Nguyen, Thanh Hung Do
Abstract:
AI-assisted gait analysis holds promise for improving Parkinson's Disease (PD) care, but current clinical dashboards lack transparency and offer no meaningful way for clinicians to interrogate or contest AI decisions. We present Con-GaIT (Contestable Gait Interpretation & Tracking), a clinician-centered system that advances Contestable AI through a tightly integrated interface designed for interpretability, oversight, and procedural recourse. Grounded in HCI principles, ConGaIT enables structured disagreement via a novel Contest & Justify interaction pattern, supported by visual explanations, role-based feedback, and traceable justification logs. Evaluated using the Contestability Assessment Score (CAS), the framework achieves a score of 0.970, demonstrating that contestability can be operationalized through human-centered design in compliance with emerging regulatory standards. A demonstration of the framework is available at https://github.com/hungdothanh/Con-GaIT.
中文: Con-GaIT系统以临床医生为核心,通过可解释性设计和争议机制提升帕金森病步态分析的透明度,其人本设计实现了高度可争议性评估分数。
English: Con-GaIT introduces a clinician-centered AI system that enhances gait analysis for Parkinson's Disease by integrating interpretability features and a contestability framework, achieving high contestability scores through human-centered design.
Authors:Shaoan Xie, Lingjing Kong, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P. Xing, Guangyi Chen, Kun Zhang
Abstract:
Contrastive Language-Image Pre-training (CLIP)~\citep{radford2021learning} has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning. However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard. On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts -- ultimately limiting its generalization on certain downstream tasks involving short prompts.
In this paper, we establish theoretical conditions that enable flexible alignment between textual and visual representations across varying levels of granularity. Specifically, our framework ensures that a model can not only \emph{preserve} cross-modal semantic information in its entirety but also \emph{disentangle} visual representations to capture fine-grained textual concepts. Building on this foundation, we introduce \ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner. Superior performance across various tasks demonstrates its capability to handle information misalignment and supports our identification theory. The code is available at https://github.com/Mid-Push/SmartCLIP.
中文: CLIP模型在图文数据中存在信息错位和表征纠缠的问题,而本文提出的SmartCLIP框架通过模块化对齐实现了跨模态语义的灵活匹配与解耦表示,显著提升了泛化能力。
English: CLIP faces challenges with information misalignment and entangled representations in image-text datasets, but our proposed SmartCLIP framework enables flexible cross-modal alignment and disentangled visual concepts to improve generalization.
Authors:Stéphane d'Ascoli, Jérémy Rapin, Yohann Benchetrit, Hubert Banville, Jean-Rémi King
Abstract:
Historically, neuroscience has progressed by fragmenting into specialized domains, each focusing on isolated modalities, tasks, or brain regions. While fruitful, this approach hinders the development of a unified model of cognition. Here, we introduce TRIBE, the first deep neural network trained to predict brain responses to stimuli across multiple modalities, cortical areas and individuals. By combining the pretrained representations of text, audio and video foundational models and handling their time-evolving nature with a transformer, our model can precisely model the spatial and temporal fMRI responses to videos, achieving the first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors. Ablations show that while unimodal models can reliably predict their corresponding cortical networks (e.g. visual or auditory networks), they are systematically outperformed by our multimodal model in high-level associative cortices. Currently applied to perception and comprehension, our approach paves the way towards building an integrative model of representations in the human brain. Our code is available at https://github.com/facebookresearch/algonauts-2025.
中文: TRIBE作为首个多模态深度神经网络,通过整合文本、音频和视觉数据,能精确预测跨皮层区域的脑部响应,在高级联合皮层表现显著优于单模态模型,为构建统一认知模型开辟了新路径。
English: TRIBE is a pioneering multimodal deep neural network that integrates text, audio, and visual data to accurately predict brain responses across cortical regions, outperforming unimodal approaches and advancing toward a unified model of cognition.
Authors:Clark Mingxuan Ju, Liam Collins, Leonardo Neves, Bhuvesh Kumar, Louis Yufeng Wang, Tong Zhao, Neil Shah
Abstract:
Generative recommendation (GR) has gained increasing attention for its promising performance compared to traditional models. A key factor contributing to the success of GR is the semantic ID (SID), which converts continuous semantic representations (e.g., from large language models) into discrete ID sequences. This enables GR models with SIDs to both incorporate semantic information and learn collaborative filtering signals, while retaining the benefits of discrete decoding. However, varied modeling techniques, hyper-parameters, and experimental setups in existing literature make direct comparisons between GR proposals challenging. Furthermore, the absence of an open-source, unified framework hinders systematic benchmarking and extension, slowing model iteration. To address this challenge, our work introduces and open-sources a framework for Generative Recommendation with semantic ID, namely GRID, specifically designed for modularity to facilitate easy component swapping and accelerate idea iteration. Using GRID, we systematically experiment with and ablate different components of GR models with SIDs on public benchmarks. Our comprehensive experiments with GRID reveal that many overlooked architectural components in GR models with SIDs substantially impact performance. This offers both novel insights and validates the utility of an open-source platform for robust benchmarking and GR research advancement. GRID is open-sourced at https://github.com/snap-research/GRID.
Chinese: 生成式推荐通过语义ID融合语义信息和协同过滤信号提升性能,但缺乏统一框架阻碍了进展,因此我们开源了GRID平台,旨在实现系统化基准测试和加速研究迭代。
English: Generative recommendation with semantic IDs enhances performance by integrating semantic and collaborative filtering signals, yet the lack of a unified framework impedes progress, leading to the introduction of the open-source GRID platform for systematic benchmarking and accelerated research.
Authors:Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang
Abstract:
Jailbreak attacks aim to exploit large language models (LLMs) by inducing them to generate harmful content, thereby revealing their vulnerabilities. Understanding and addressing these attacks is crucial for advancing the field of LLM safety. Previous jailbreak approaches have mainly focused on direct manipulations of harmful intent, with limited attention to the impact of persona prompts. In this study, we systematically explore the efficacy of persona prompts in compromising LLM defenses. We propose a genetic algorithm-based method that automatically crafts persona prompts to bypass LLM's safety mechanisms. Our experiments reveal that: (1) our evolved persona prompts reduce refusal rates by 50-70% across multiple LLMs, and (2) these prompts demonstrate synergistic effects when combined with existing attack methods, increasing success rates by 10-20%. Our code and data are available at https://github.com/CjangCjengh/Generic_Persona.
Chinese Summary: 本研究提出一种基于遗传算法的方法,通过自动生成角色提示有效绕过大型语言模型的安全防护,将拒绝率降低50-70%,并将现有攻击成功率提升10-20%。
English Summary: This study introduces a genetic algorithm-based method to automatically create persona prompts that effectively bypass large language model safety mechanisms, reducing refusal rates by 50-70% and enhancing existing attack success rates by 10-20%.
Authors:Umair Nawaz, Muhammad Zaigham Zaheer, Fahad Shahbaz Khan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Abstract:
Crops, fisheries and livestock form the backbone of global food production, essential to feed the ever-growing global population. However, these sectors face considerable challenges, including climate variability, resource limitations, and the need for sustainable management. Addressing these issues requires efficient, accurate, and scalable technological solutions, highlighting the importance of artificial intelligence (AI). This survey presents a systematic and thorough review of more than 200 research works covering conventional machine learning approaches, advanced deep learning techniques (e.g., vision transformers), and recent vision-language foundation models (e.g., CLIP) in the agriculture domain, focusing on diverse tasks such as crop disease detection, livestock health management, and aquatic species monitoring. We further cover major implementation challenges such as data variability and experimental aspects: datasets, performance evaluation metrics, and geographical focus. We finish the survey by discussing potential open research directions emphasizing the need for multimodal data integration, efficient edge-device deployment, and domain-adaptable AI models for diverse farming environments. Rapid growth of evolving developments in this field can be actively tracked on our project page: https://github.com/umair1221/AI-in-Agriculture
中文: 本综述系统梳理了农业领域200多项人工智能应用,涵盖从传统机器学习到先进视觉语言模型的技术,重点探讨了作物病害检测、畜牧健康管理等任务的实施挑战与未来研究方向。
English: This survey systematically reviews over 200 AI applications in agriculture—from traditional machine learning to advanced vision-language models—addressing challenges like crop disease detection and livestock monitoring while highlighting implementation hurdles and future research directions.
Authors:Sicheng Zhang, Binzhu Xie, Zhonghao Yan, Yuli Zhang, Donghao Zhou, Xiaofei Chen, Shi Qiu, Jiaqi Liu, Guoyang Xie, Zhichao Lu
Abstract:
Model performance in text-to-image (T2I) and image-to-image (I2I) generation often depends on multiple aspects, including quality, alignment, diversity, and robustness. However, models' complex trade-offs among these dimensions have rarely been explored due to (1) the lack of datasets that allow fine-grained quantification of these trade-offs, and (2) the use of a single metric for multiple dimensions. To bridge this gap, we introduce TRIG-Bench (Trade-offs in Image Generation), which spans 10 dimensions (Realism, Originality, Aesthetics, Content, Relation, Style, Knowledge, Ambiguity, Toxicity, and Bias), contains 40,200 samples, and covers 132 pairwise dimensional subsets. Furthermore, we develop TRIGScore, a VLM-as-judge metric that automatically adapts to various dimensions. Based on TRIG-Bench and TRIGScore, we evaluate 14 models across T2I and I2I tasks. In addition, we propose the Relation Recognition System to generate the Dimension Trade-off Map (DTM) that visualizes the trade-offs among model-specific capabilities. Our experiments demonstrate that DTM consistently provides a comprehensive understanding of the trade-offs between dimensions for each type of generative model. Notably, we show that the model's dimension-specific weaknesses can be mitigated through fine-tuning on DTM to enhance overall performance. Code is available at: https://github.com/fesvhtr/TRIG
中文:TRIG-Bench和TRIGScore的提出旨在评估文生图与图生图在十个维度上的权衡关系,通过维度权衡图实现模型能力的全面分析,并可通过微调有效提升生成性能。
English: TRIG-Bench and TRIGScore are introduced to evaluate trade-offs across ten dimensions in text-to-image and image-to-image generation, enabling comprehensive model analysis and performance enhancement through fine-tuning.
Authors:Joy Arulraj
Abstract:
System design is often taught through domain-specific solutions specific to particular domains, such as databases, operating systems, or computer architecture, each with its own methods and vocabulary. While this diversity is a strength, it can obscure cross-cutting principles that recur across domains. This paper proposes a preliminary "periodic table" of system design principles distilled from several domains in computer systems. The goal is a shared, concise vocabulary that helps students, researchers, and practitioners reason about structure and trade-offs, compare designs across domains, and communicate choices more clearly. For supporting materials and updates, please refer to the repository at: https://github.com/jarulraj/periodic-table.
中文: 本文提出一个初步的系统设计“元素周期表”,旨在创建通用术语,帮助理解跨领域结构权衡和比较不同计算机系统设计。
English: This paper introduces a preliminary "periodic table" of system design principles to establish a shared vocabulary that aids in understanding structural trade-offs and comparing designs across various computer system domains.
Authors:Honghua Dong, Jiacheng Yang, Xun Deng, Yuhe Jiang, Gennady Pekhimenko, Fan Long, Xujie Si
Abstract:
Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have shown promise in code understanding, their type inference capabilities remain underexplored. We introduce TypyBench, a benchmark designed to evaluate LLMs' type inference across entire Python repositories. TypyBench features two novel metrics: TypeSim, which captures nuanced semantic relationships between predicted and ground truth types, and TypeCheck, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent TypeSim scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. TypyBench provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts. Our code and data are available at https://github.com/typybench/typybench.
中文摘要:TypyBench是一个评估大语言模型在Python仓库中类型推断能力的新基准,发现模型虽在类型相似度上表现良好,但在处理复杂嵌套类型和保持代码库一致性方面存在显著不足。
English Summary: TypyBench is a new benchmark for evaluating large language models' type inference in Python repositories, revealing their strengths in type similarity but weaknesses in handling complex nested types and maintaining consistency across codebases.
Authors:Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, Wei Cheng
Abstract:
Large Language Models (LLMs) excel at many reasoning tasks but struggle with knowledge-intensive queries due to their inability to dynamically access up-to-date or domain-specific information. Retrieval-Augmented Generation (RAG) has emerged as a promising solution, enabling LLMs to ground their responses in external sources. However, existing RAG methods lack fine-grained control over both the query and source sides, often resulting in noisy retrieval and shallow reasoning. In this work, we introduce DeepSieve, an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router. DeepSieve decomposes complex queries into structured sub-questions and recursively routes each to the most suitable knowledge source, filtering irrelevant information through a multi-stage distillation process. Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design. Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches. Our codes are available at https://github.com/MinghoKwok/DeepSieve.
中文: 大语言模型在处理知识密集型查询时存在不足,而DeepSieve框架通过分解查询和多阶段信息过滤,提升了检索增强生成的推理深度与检索精度。
English: Large Language Models struggle with knowledge-intensive queries, but the proposed DeepSieve framework enhances Retrieval-Augmented Generation by decomposing queries and filtering information through multi-stage distillation, improving reasoning depth and retrieval precision.
Authors:Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song, Bingqi Chen, Xiawu Zheng, Hui Li
Abstract:
The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE for enhancing GUI agents at both training and inference. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a continuous reward function to incentivize high-precision grounding; 2) a ``Simple Thinking'' reward to balance planning with speed and grounding accuracy; and 3) a cropping-based resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present decomposed grounding with selection to dramatically improve grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art grounding performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2 while it also exhibits strong general agent capabilities. For instance, using both our training and inference enhancement methods brings 23\% grounding accuracy improvement over the best baseline on ScreenSpot-Pro. We provide the code in https://github.com/KDEGroup/UI-AGILE.
中文:UI-AGILE通过连续奖励和裁剪策略优化训练,并采用分解式定位改进推理,在基准测试中实现了图形用户界面代理的最先进性能。
English: UI-AGILE introduces training enhancements like continuous rewards and cropping strategies, along with inference improvements through decomposed grounding, achieving state-of-the-art GUI agent performance on benchmarks.
Authors:Ziyun Dai, Xiaoqiang Li, Shaohua Zhang, Yuanchen Wu, Jide Li
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual understanding and multimodal reasoning. However, LVLMs frequently exhibit hallucination phenomena, manifesting as the generated textual responses that demonstrate inconsistencies with the provided visual content. Existing hallucination mitigation methods are predominantly text-centric, the challenges of visual-semantic alignment significantly limit their effectiveness, especially when confronted with fine-grained visual understanding scenarios. To this end, this paper presents ViHallu, a Vision-Centric Hallucination mitigation framework that enhances visual-semantic alignment through Visual Variation Image Generation and Visual Instruction Construction. ViHallu introduces visual variation images with controllable visual alterations while maintaining the overall image structure. These images, combined with carefully constructed visual instructions, enable LVLMs to better understand fine-grained visual content through fine-tuning, allowing models to more precisely capture the correspondence between visual content and text, thereby enhancing visual-semantic alignment. Extensive experiments on multiple benchmarks show that ViHallu effectively enhances models' fine-grained visual understanding while significantly reducing hallucination tendencies. Furthermore, we release ViHallu-Instruction, a visual instruction dataset specifically designed for hallucination mitigation and visual-semantic alignment. Code is available at https://github.com/oliviadzy/ViHallu.
Chinese: ViHallu是一种以视觉为中心的幻觉缓解框架,通过生成视觉变化图像和构建视觉指令,增强大型视觉语言模型的细粒度视觉理解能力,有效减少幻觉现象并提升视觉语义对齐。
English: ViHallu is a vision-centric framework that mitigates hallucinations in Large Vision-Language Models by generating visual variation images and constructing visual instructions, thereby enhancing fine-grained visual understanding and alignment between visual content and text.
Authors:Jihao Gu, Kun Li, Fei Wang, Yanyan Wei, Zhiliang Wu, Hehe Fan, Meng Wang
Abstract:
Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network (MMN) that implicitly captures and modulates subtle motion cues to enhance spatial-temporal representation learning. Specifically, we introduce a Motion-guided Skeletal Modulation module (MSM) to inject motion cues at the skeletal level, acting as a control signal to guide spatial representation modeling. In parallel, we design a Motion-guided Temporal Modulation module (MTM) to incorporate motion information at the frame level, facilitating the modeling of holistic motion patterns in micro-actions. Finally, we propose a motion consistency learning strategy to aggregate the motion cues from multi-scale features for micro-action classification. Experimental results on the Micro-Action 52 and iMiGUE datasets demonstrate that MMN achieves state-of-the-art performance in skeleton-based micro-action recognition, underscoring the importance of explicitly modeling subtle motion cues. The code will be available at https://github.com/momiji-bit/MMN.
中文摘要:提出的运动引导调制网络(MMN)通过骨骼和时序调制模块有效捕捉细微运动线索,显著提升了微动作识别性能,在基准数据集上取得了最优结果。
English Summary: The proposed Motion-guided Modulation Network (MMN) effectively captures subtle motion cues through skeletal and temporal modulation modules to improve micro-action recognition, achieving state-of-the-art results on benchmark datasets.
Authors:Théo Ladune, Thomas Leguay, Pierrick Philippe, Gordon Clare, Félix Henry
Abstract:
Motion compensation is a key component of video codecs. Conventional codecs (HEVC and VVC) have carefully refined this coding step, with an important focus on sub-pixel motion compensation. On the other hand, learned codecs achieve sub-pixel motion compensation through simple bilinear filtering. This paper offers to improve learned codec motion compensation by drawing inspiration from conventional codecs. It is shown that the usage of more advanced interpolation filters, block-based motion information and finite motion accuracy lead to better compression performance and lower decoding complexity. Experimental results are provided on the Cool-chic video codec, where we demonstrate a rate decrease of more than 10% and a lowering of motion-related decoding complexity from 391 MAC per pixel to 214 MAC per pixel. All contributions are made open-source at https://github.com/Orange-OpenSource/Cool-Chic
中文摘要:本文通过借鉴传统编解码器的先进插值滤波器、基于块的运动信息和有限运动精度,改进了学习型视频编解码器的运动补偿,在Cool-chic编解码器中实现码率降低超10%且运动相关解码复杂度减半。
English Summary: This paper enhances learned video codecs by integrating advanced interpolation filters, block-based motion information, and finite motion accuracy from conventional codecs, achieving over 10% rate reduction and nearly halving motion-related decoding complexity in the Cool-chic codec.
Authors:Tianhong Gao, Yannian Fu, Weiqun Wu, Haixiao Yue, Shanshan Liu, Gang Zhang
Abstract:
Large Language Models (LLMs), enhanced through agent tuning, have demonstrated remarkable capabilities in Chain-of-Thought (CoT) and tool utilization, significantly surpassing the performance of standalone models. However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models. To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage. Our dataset is constructed through a novel four-stage data engine: 1) We first curate publicly available multimodal datasets containing question-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales for the original question-answer pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information through a multi-turn paradigm; 3) Furthermore, we refine the rationales through reflection to ensure logical consistency and accuracy, creating a multi-turn dialogue dataset with both Rationale and Reflection (RR); 4) Finally, to enhance efficiency, we optionally compress multi-turn dialogues into a One-turn Rationale and Reflection (ORR) format. By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains. For instance, the InternVL2.5-8B-RR model achieves an average improvement of 2.7% across eight public benchmarks and 8.8% on the RAG benchmark Dyn-VQA, demonstrating the dataset's effectiveness in enhancing multimodal reasoning and tool-based capabilities. The dataset is publicly available at https://github.com/VIS-MPU-Agent/MMAT-1M.
中文: 研究者推出了首个百万规模的多模态智能体调优数据集MMAT-1M,通过四阶段数据引擎构建思维链与工具调用能力,显著提升了多模态模型在推理任务中的表现。
English: Researchers developed MMAT-1M, a million-scale multimodal agent tuning dataset featuring Chain-of-Thought reasoning and tool usage, which significantly boosts the performance of multimodal models on benchmarks.
Authors:Nicola Fanelli, Gennaro Vessio, Giovanna Castellano
Abstract:
Analyzing digitized artworks presents unique challenges, requiring not only visual interpretation but also a deep understanding of rich artistic, contextual, and historical knowledge. We introduce ArtSeek, a multimodal framework for art analysis that combines multimodal large language models with retrieval-augmented generation. Unlike prior work, our pipeline relies only on image input, enabling applicability to artworks without links to Wikidata or Wikipedia-common in most digitized collections. ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy enabled through in-context examples for complex visual question answering and artwork explanation via Qwen2.5-VL. Central to this approach is WikiFragments, a Wikipedia-scale dataset of image-text fragments curated to support knowledge-grounded multimodal reasoning. Our framework achieves state-of-the-art results on multiple benchmarks, including a +8.4% F1 improvement in style classification over GraphCLIP and a +7.1 BLEU@1 gain in captioning on ArtPedia. Qualitative analyses show that ArtSeek can interpret visual motifs, infer historical context, and retrieve relevant knowledge, even for obscure works. Though focused on visual arts, our approach generalizes to other domains requiring external knowledge, supporting scalable multimodal AI research. Both the dataset and the source code will be made publicly available at https://github.com/cilabuniba/artseek.
中文: ArtSeek是一种多模态框架,仅通过图像输入分析艺术品,结合检索增强生成技术,并利用维基百科规模的数据集提升视觉与背景理解能力,取得了领先水平的成果。
English: ArtSeek is a multimodal framework that analyzes artworks using only image inputs, integrating retrieval-augmented generation and achieving state-of-the-art results by leveraging a Wikipedia-scale dataset for enhanced visual and contextual understanding.
Authors:Shengjia Chen, Ruchika Verma, Kevin Clare, Jannes Jegminat, Eugenia Alleva, Kuan-lin Huang, Brandon Veremis, Thomas Fuchs, Gabriele Campanella
Abstract:
Artificial Intelligence (AI) has demonstrated success in computational pathology (CPath) for disease detection, biomarker classification, and prognosis prediction. However, its potential to learn unintended demographic biases, particularly those related to social determinants of health, remains understudied. This study investigates whether deep learning models can predict self-reported race from digitized dermatopathology slides and identifies potential morphological shortcuts. Using a multisite dataset with a racially diverse population, we apply an attention-based mechanism to uncover race-associated morphological features. After evaluating three dataset curation strategies to control for confounding factors, the final experiment showed that White and Black demographic groups retained high prediction performance (AUC: 0.799, 0.762), while overall performance dropped to 0.663. Attention analysis revealed the epidermis as a key predictive feature, with significant performance declines when these regions were removed. These findings highlight the need for careful data curation and bias mitigation to ensure equitable AI deployment in pathology. Code available at: https://github.com/sinai-computational-pathology/CPath_SAIF.
中文: 本研究揭示深度学习模型能通过表皮形态特征从病理切片中预测患者种族,凸显了人工智能在医疗领域需通过精细数据管理来规避偏见风险的必要性。
English: This study demonstrates that deep learning models can predict patient race from pathology slides with notable accuracy, revealing unintended bias through epidermal features and underscoring the need for improved data curation to ensure equitable AI in healthcare.
Authors:Viacheslav Pirogov, Maksim Artemev
Abstract:
Deepfakes powered by advanced machine learning models present a significant and evolving threat to identity verification and the authenticity of digital media. Although numerous detectors have been developed to address this problem, their effectiveness has yet to be tested when applied to real-world data. In this work we evaluate modern deepfake detectors, introducing a novel testing procedure designed to mimic real-world scenarios for deepfake detection. Using state-of-the-art deepfake generation methods, we create a comprehensive dataset containing more than 500,000 high-quality deepfake images. Our analysis shows that detecting deepfakes still remains a challenging task. The evaluation shows that in fewer than half of the deepfake detectors tested achieved an AUC score greater than 60%, with the lowest being 50%. We demonstrate that basic image manipulations, such as JPEG compression or image enhancement, can significantly reduce model performance. All code and data are publicly available at https://github.com/SumSubstance/Deepfake-Detectors-in-the-Wild.
中文: 现代深度伪造检测器在现实场景中表现不佳,仅不到半数检测器的AUC超过60%,且简单的图像处理会大幅降低其检测性能。
English: Modern deepfake detectors struggle in real-world scenarios, with fewer than half achieving over 60% AUC and basic image manipulations significantly degrading their performance.
Authors:Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis
Abstract:
Pain is a complex and pervasive condition that affects a significant portion of the population. Accurate and consistent assessment is essential for individuals suffering from pain, as well as for developing effective management strategies in a healthcare system. Automatic pain assessment systems enable continuous monitoring, support clinical decision-making, and help minimize patient distress while mitigating the risk of functional deterioration. Leveraging physiological signals offers objective and precise insights into a person's state, and their integration in a multimodal framework can further enhance system performance. This study has been submitted to the Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN). The proposed approach introduces Tiny-BioMoE, a lightweight pretrained embedding model for biosignal analysis. Trained on 4.4 million biosignal image representations and consisting of only 7.3 million parameters, it serves as an effective tool for extracting high-quality embeddings for downstream tasks. Extensive experiments involving electrodermal activity, blood volume pulse, respiratory signals, peripheral oxygen saturation, and their combinations highlight the model's effectiveness across diverse modalities in automatic pain recognition tasks. The model's architecture (code) and weights are available at https://github.com/GkikasStefanos/Tiny-BioMoE.
中文摘要:本研究提出Tiny-BioMoE轻量级预训练模型,通过多模态生物信号实验验证了其在自动疼痛识别任务中的有效性。
English Summary: This study introduces Tiny-BioMoE, a lightweight pretrained model for biosignal analysis that demonstrates strong performance in automatic pain recognition across multiple physiological signals through extensive experiments.
Authors:Raffaele Pojer, Andrea Passerini, Kim G. Larsen, Manfred Jaeger
Abstract:
Graph neural networks (GNNs) excel at predictive tasks on graph-structured data but often lack the ability to incorporate symbolic domain knowledge and perform general reasoning. Relational Bayesian Networks (RBNs), in contrast, enable fully generative probabilistic modeling over graph-like structures and support rich symbolic knowledge and probabilistic inference. This paper presents a neuro-symbolic framework that seamlessly integrates GNNs into RBNs, combining the learning strength of GNNs with the flexible reasoning capabilities of RBNs.
We develop two implementations of this integration: one compiles GNNs directly into the native RBN language, while the other maintains the GNN as an external component. Both approaches preserve the semantics and computational properties of GNNs while fully aligning with the RBN modeling paradigm. We also propose a maximum a-posteriori (MAP) inference method for these neuro-symbolic models.
To demonstrate the framework's versatility, we apply it to two distinct problems. First, we transform a GNN for node classification into a collective classification model that explicitly models homo- and heterophilic label patterns, substantially improving accuracy. Second, we introduce a multi-objective network optimization problem in environmental planning, where MAP inference supports complex decision-making. Both applications include new publicly available benchmark datasets.
This work introduces a powerful and coherent neuro-symbolic approach to graph data, bridging learning and reasoning in ways that enable novel applications and improved performance across diverse tasks.
Chinese: 本文提出了一种神经符号框架,将图神经网络(GNN)集成到关系贝叶斯网络(RBN)中,结合GNN的学习能力与RBN的概率推理优势,在多种图结构任务中实现了创新应用和性能提升。
English: This paper introduces a neuro-symbolic framework that integrates graph neural networks (GNNs) into relational Bayesian networks (RBNs), combining GNNs' learning capabilities with RBNs' probabilistic reasoning to enable novel applications and improved performance across diverse graph-based tasks.
Authors:Julia Wolleb, Florentin Bieder, Paul Friedrich, Hemant D. Tagare, Xenophon Papademetris
Abstract:
Ultrasound is widely used in clinical care, yet standard deep learning methods often struggle with full video analysis due to non-standardized acquisition and operator bias. We offer a new perspective on ultrasound video analysis through implicit neural representations (INRs). We build on Functa, an INR framework in which each image is represented by a modulation vector that conditions a shared neural network. However, its extension to the temporal domain of medical videos remains unexplored. To address this gap, we propose VidFuncta, a novel framework that leverages Functa to encode variable-length ultrasound videos into compact, time-resolved representations. VidFuncta disentangles each video into a static video-specific vector and a sequence of time-dependent modulation vectors, capturing both temporal dynamics and dataset-level redundancies. Our method outperforms 2D and 3D baselines on video reconstruction and enables downstream tasks to directly operate on the learned 1D modulation vectors. We validate VidFuncta on three public ultrasound video datasets -- cardiac, lung, and breast -- and evaluate its downstream performance on ejection fraction prediction, B-line detection, and breast lesion classification. These results highlight the potential of VidFuncta as a generalizable and efficient representation framework for ultrasound videos. Our code is publicly available under https://github.com/JuliaWolleb/VidFuncta_public.
中文: 本文提出VidFuncta框架,通过隐式神经表示将超声视频编码为紧凑的时间分辨向量,在视频重建任务中超越传统方法,并在心脏射血分数预测等多个下游任务中展现出优越性能。
English: This paper introduces VidFuncta, a novel framework using implicit neural representations to encode ultrasound videos into compact time-resolved vectors, outperforming traditional methods in reconstruction and enabling efficient downstream task performance across multiple clinical applications.
Authors:Jiahao He, Daerji Suolang, Keren Fu, Qijun Zhao
Abstract:
Applying salient object detection (SOD) to RGB-D videos is an emerging task called RGB-D VSOD and has recently gained increasing interest, due to considerable performance gains of incorporating motion and depth and that RGB-D videos can be easily captured now in daily life. Existing RGB-D VSOD models have different attempts to derive motion cues, in which extracting motion information explicitly from optical flow appears to be a more effective and promising alternative. Despite this, there remains a key issue that how to effectively utilize optical flow and depth to assist the RGB modality in SOD. Previous methods always treat optical flow and depth equally with respect to model designs, without explicitly considering their unequal contributions in individual scenarios, limiting the potential of motion and depth. To address this issue and unleash the power of motion and depth, we propose a novel selective cross-modal fusion framework (SMFNet) for RGB-D VSOD, incorporating a pixel-level selective fusion strategy (PSF) that achieves optimal fusion of optical flow and depth based on their actual contributions. Besides, we propose a multi-dimensional selective attention module (MSAM) to integrate the fused features derived from PSF with the remaining RGB modality at multiple dimensions, effectively enhancing feature representation to generate refined features. We conduct comprehensive evaluation of SMFNet against 19 state-of-the-art models on both RDVS and DVisal datasets, making the evaluation the most comprehensive RGB-D VSOD benchmark up to date, and it also demonstrates the superiority of SMFNet over other models. Meanwhile, evaluation on five video benchmark datasets incorporating synthetic depth validates the efficacy of SMFNet as well. Our code and benchmark results are made publicly available at https://github.com/Jia-hao999/SMFNet.
Chinese Summary: 该研究提出SMFNet这一选择性跨模态融合框架,通过像素级选择性融合策略动态整合光流与深度信息,在RGB-D视频显著目标检测任务中实现了超越现有模型的优越性能。
English Summary: The study introduces SMFNet, a selective cross-modal fusion framework for RGB-D video salient object detection that dynamically integrates optical flow and depth based on their actual contributions, achieving superior performance over existing models on comprehensive benchmarks.
Authors:Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang
Abstract:
Large Language Models (LLMs) have made remarkable progress in enhancing step-by-step reasoning through reinforcement learning. However, the Group Relative Policy Optimization (GRPO) algorithm, which relies on sparse reward rules, often encounters the issue of identical rewards within groups, leading to the advantage collapse problem. Existing works typically address this challenge from two perspectives: enforcing model reflection to enhance response diversity, and introducing internal feedback to augment the training signal (advantage). In this work, we begin by analyzing the limitations of model reflection and investigating the policy entropy of responses at the fine-grained sample level. Based on our experimental findings, we propose the EDGE-GRPO algorithm, which adopts \textbf{E}ntropy-\textbf{D}riven Advantage and \textbf{G}uided \textbf{E}rror Correction to effectively mitigate the problem of advantage collapse. Extensive experiments on several main reasoning benchmarks demonstrate the effectiveness and superiority of our approach. It is available at https://github.com/ZhangXJ199/EDGE-GRPO.
中文: EDGE-GRPO算法通过引入熵驱动的优势计算和引导式错误修正,有效解决了大语言模型推理中的优势崩溃问题,在多个基准测试中展现出卓越性能。
English: The EDGE-GRPO algorithm effectively mitigates the advantage collapse problem in LLM reasoning by incorporating entropy-driven advantage calculation and guided error correction, demonstrating superior performance across multiple benchmarks.
Authors:Yifan Wei, Xiaoyan Yu, Yixuan Weng, Tengfei Pan, Angsheng Li, Li Du
Abstract:
Large Language Models (LLMs), when enhanced through reasoning-oriented post-training, evolve into powerful Large Reasoning Models (LRMs). Tool-Integrated Reasoning (TIR) further extends their capabilities by incorporating external tools, but existing methods often rely on rigid, predefined tool-use patterns that risk degrading core language competence. Inspired by the human ability to adaptively select tools, we introduce AutoTIR, a reinforcement learning framework that enables LLMs to autonomously decide whether and which tool to invoke during the reasoning process, rather than following static tool-use strategies. AutoTIR leverages a hybrid reward mechanism that jointly optimizes for task-specific answer correctness, structured output adherence, and penalization of incorrect tool usage, thereby encouraging both precise reasoning and efficient tool integration. Extensive evaluations across diverse knowledge-intensive, mathematical, and general language modeling tasks demonstrate that AutoTIR achieves superior overall performance, significantly outperforming baselines and exhibits superior generalization in tool-use behavior. These results highlight the promise of reinforcement learning in building truly generalizable and scalable TIR capabilities in LLMs. The code and data are available at https://github.com/weiyifan1023/AutoTIR.
中文摘要:AutoTIR是一个强化学习框架,通过混合奖励机制联合优化任务答案正确性和工具使用效率,使大语言模型能够在推理过程中自主选择调用外部工具,在多项任务中展现出卓越性能。
English Summary: AutoTIR is a reinforcement learning framework that enables Large Language Models to autonomously select and utilize external tools during reasoning, achieving superior performance across various tasks through a hybrid reward mechanism optimizing correctness and efficiency.
Authors:Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, Zhao Zhong
Abstract:
Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%. Codes and models are available at $\href{https://github.com/Tencent-Hunyuan/MixGRPO}{MixGRPO}$.
中文: MixGRPO采用滑动窗口混合采样策略,结合SDE和ODE仅优化部分去噪步骤,在图像生成的人类偏好对齐中显著提升了训练效率和性能。
English: MixGRPO introduces a mixed sampling strategy with a sliding window that combines SDE and ODE to optimize only specific denoising steps, significantly improving training efficiency and performance in human preference alignment for image generation.
Authors:Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, Zhao Zhong
Abstract:
Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%. Codes and models are available at $\href{https://github.com/Tencent-Hunyuan/MixGRPO}{MixGRPO}$.
中文: MixGRPO采用滑动窗口混合采样策略,结合SDE和ODE仅优化部分去噪步骤,在图像生成的人类偏好对齐中显著提升了训练效率和性能。
English: MixGRPO introduces a mixed sampling strategy with a sliding window that combines SDE and ODE to optimize only specific denoising steps, significantly improving training efficiency and performance in human preference alignment for image generation.
Authors:Xie Zhang, Yina Wang, Chenshu Wu
Abstract:
The empirical success of deep learning has spurred its application to the radio-frequency (RF) domain, leading to significant advances in Deep Wireless Sensing (DWS). However, most existing DWS models function as black boxes with limited interpretability, which hampers their generalizability and raises concerns in security-sensitive physical applications. In this work, inspired by the remarkable advances of white-box transformers, we present RF-CRATE, the first mathematically interpretable deep network architecture for RF sensing, grounded in the principles of complex sparse rate reduction. To accommodate the unique RF signals, we conduct non-trivial theoretical derivations that extend the original real-valued white-box transformer to the complex domain. By leveraging the CR-Calculus framework, we successfully construct a fully complex-valued white-box transformer with theoretically derived self-attention and residual multi-layer perceptron modules. Furthermore, to improve the model's ability to extract discriminative features from limited wireless data, we introduce Subspace Regularization, a novel regularization strategy that enhances feature diversity, resulting in an average performance improvement of 19.98% across multiple sensing tasks. We extensively evaluate RF-CRATE against seven baselines with multiple public and self-collected datasets involving different RF signals. The results show that RF-CRATE achieves performance on par with thoroughly engineered black-box models, while offering full mathematical interpretability. More importantly, by extending CRATE to the complex domain, RF-CRATE yields substantial improvements, achieving an average classification gain of 5.08% and reducing regression error by 10.34% across diverse sensing tasks compared to CRATE. RF-CRATE is fully open-sourced at: https://github.com/rfcrate/RF_CRATE.
中文: 本文提出了首个数学可解释的射频传感深度网络RF-CRATE,通过将白盒Transformer扩展至复数域,在保持与黑盒模型相当性能的同时实现了完全可解释性,并借助子空间正则化显著提升了特征提取能力。
English: This paper introduces RF-CRATE, the first mathematically interpretable deep network for RF sensing that extends white-box transformers to the complex domain, achieving performance comparable to black-box models while offering full interpretability and improved feature extraction through subspace regularization.
Authors:Zhaolong Wang, Tongfeng Sun, Mingzheng Du, Yachao Huang
Abstract:
Vision-language pre-trained models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, and prompt learning has emerged as an efficient alternative to full fine-tuning. However, existing methods often struggle with generalization to novel classes, a phenomenon attributed to overfitting on seen classes and forgetting general knowledge. Furthermore, recent approaches that improve generalization often introduce complex architectures or heavy computational overhead. In this paper, we propose a Multiple Semantic-Guided Context Optimization (MSGCoOp) framework to enhance few-shot generalization while maintaining computational efficiency. Our approach leverages an ensemble of parallel learnable context vectors to capture diverse semantic aspects. To enrich these prompts, we introduce a semantic guidance mechanism that aligns them with comprehensive class descriptions automatically generated by a Large Language Model (LLM). Furthermore, a diversity regularization loss encourages the prompts to learn complementary and orthogonal features, preventing them from collapsing into redundant representations. Extensive experiments on 11 benchmark datasets show that MSGCoOp significantly improves performance on base-to-novel generalization, achieving an average harmonic mean improvement of 1.10\% over the strong KgCoOp baseline. Our method also demonstrates enhanced robustness in cross-domain generalization tasks. Our code is avaliable at: \href{https://github.com/Rain-Bus/MSGCoOp}{https://github.com/Rain-Bus/MSGCoOp}.
Chinese: MSGCoOp框架通过采用大语言模型生成的类别描述引导的并行可学习上下文向量和多样性正则化,提升了视觉语言模型的少样本泛化能力,在基础到新类别及跨领域任务中表现优异,同时保持了计算效率。
English: The MSGCoOp framework enhances few-shot generalization in vision-language models by using parallel learnable context vectors guided by LLM-generated class descriptions and diversity regularization, achieving improved performance on base-to-novel and cross-domain tasks while maintaining computational efficiency.
Authors:Zhishu Liu, Kaishen Yuan, Bo Zhao, Yong Xu, Zitong Yu
Abstract:
The detection of micro-expression Action Units (AUs) is a formidable challenge in affective computing, pivotal for decoding subtle, involuntary human emotions. While Large Language Models (LLMs) demonstrate profound reasoning abilities, their application to the fine-grained, low-intensity domain of micro-expression AU detection remains unexplored. This paper pioneers this direction by introducing \textbf{AU-LLM}, a novel framework that for the first time uses LLM to detect AUs in micro-expression datasets with subtle intensities and the scarcity of data. We specifically address the critical vision-language semantic gap, the \textbf{Enhanced Fusion Projector (EFP)}. The EFP employs a Multi-Layer Perceptron (MLP) to intelligently fuse mid-level (local texture) and high-level (global semantics) visual features from a specialized 3D-CNN backbone into a single, information-dense token. This compact representation effectively empowers the LLM to perform nuanced reasoning over subtle facial muscle movements.Through extensive evaluations on the benchmark CASME II and SAMM datasets, including stringent Leave-One-Subject-Out (LOSO) and cross-domain protocols, AU-LLM establishes a new state-of-the-art, validating the significant potential and robustness of LLM-based reasoning for micro-expression analysis. The codes are available at https://github.com/ZS-liu-JLU/AU-LLMs.
中文摘要:本文首次提出AU-LLM框架,通过增强融合投影器整合视觉特征,成功将大语言模型应用于微表情动作单元检测,在基准数据集上实现了最先进的性能,验证了大语言模型在细微面部肌肉运动分析中的强大潜力。
English Summary: This paper introduces AU-LLM, a pioneering framework that leverages Large Language Models for micro-expression Action Unit detection, achieving state-of-the-art performance through enhanced visual feature fusion and demonstrating robust reasoning capabilities on subtle facial movements.
Authors:Lian Yan, Haotian Wang, Chen Tang, Haifeng Liu, Tianyang Sun, Liangliang Liu, Yi Guan, Jingchi Jiang
Abstract:
In the agricultural domain, the deployment of large language models (LLMs) is hindered by the lack of training data and evaluation benchmarks. To mitigate this issue, we propose AgriEval, the first comprehensive Chinese agricultural benchmark with three main characteristics: (1) Comprehensive Capability Evaluation. AgriEval covers six major agriculture categories and 29 subcategories within agriculture, addressing four core cognitive scenarios: memorization, understanding, inference, and generation. (2) High-Quality Data. The dataset is curated from university-level examinations and assignments, providing a natural and robust benchmark for assessing the capacity of LLMs to apply knowledge and make expert-like decisions. (3) Diverse Formats and Extensive Scale. AgriEval comprises 14,697 multiple-choice questions and 2,167 open-ended question-and-answer questions, establishing it as the most extensive agricultural benchmark available to date. We also present comprehensive experimental results over 51 open-source and commercial LLMs. The experimental results reveal that most existing LLMs struggle to achieve 60% accuracy, underscoring the developmental potential in agricultural LLMs. Additionally, we conduct extensive experiments to investigate factors influencing model performance and propose strategies for enhancement. AgriEval is available at https://github.com/YanPioneer/AgriEval/.
中文摘要:AgriEval作为首个综合性中文农业基准,通过涵盖多类认知场景的高质量数据揭示了现有大语言模型在农业领域表现不足,准确率普遍低于60%,展现了该方向的巨大发展潜力。
English Summary: The AgriEval benchmark addresses the scarcity of agricultural data for LLMs by offering a comprehensive Chinese dataset with diverse question formats, revealing through testing that most models fall short of 60% accuracy and highlighting significant room for improvement in this field.
Authors:Aybora Koksal, A. Aydin Alatan
Abstract:
Recent advances in large language and vision-language models have enabled strong reasoning capabilities, yet they remain impractical for specialized domains like remote sensing, where annotated data is scarce and expensive. We present the first few-shot reinforcement learning with verifiable reward (RLVR) framework for satellite imagery that eliminates the need for caption supervision--relying solely on lightweight, rule-based binary or IoU-based rewards. Adapting the "1-shot RLVR" paradigm from language models to vision-language models, we employ policy-gradient optimization with as few as one curated example to align model outputs for satellite reasoning tasks. Comprehensive experiments across multiple remote sensing benchmarks--including classification, visual question answering, and grounding--show that even a single example yields substantial improvements over the base model. Scaling to 128 examples matches or exceeds models trained on thousands of annotated samples. While the extreme one-shot setting can induce mild, task-specific overfitting, our approach consistently demonstrates robust generalization and efficiency across diverse tasks. Further, we find that prompt design and loss weighting significantly influence training stability and final accuracy. Our method enables cost-effective and data-efficient development of domain-specialist vision-language reasoning models, offering a pragmatic recipe for data-scarce fields: start from a compact VLM, curate a handful of reward-checkable cases, and train via RLVR.
Chinese Summary: 该研究提出了一种用于卫星图像的带可验证奖励的少样本强化学习框架(RLVR),在遥感任务中仅需极少标注数据即可实现强大性能,仅用128个样本就能媲美基于数千样本训练的模型。
English Summary: The study introduces a few-shot reinforcement learning framework with verifiable rewards (RLVR) for satellite imagery, which achieves strong performance in remote sensing tasks using minimal annotated data, even matching models trained on thousands of samples with just 128 examples.
Authors:Shaojun E, Yuchen Yang, Jiaheng Wu, Yan Zhang, Tiejun Zhao, Ziyan Chen
Abstract:
In the latest advancements in multimodal learning, effectively addressing the spatial and semantic losses of visual data after encoding remains a critical challenge. This is because the performance of large multimodal models is positively correlated with the coupling between visual encoders and large language models. Existing approaches often face issues such as vector gaps or semantic disparities, resulting in information loss during the propagation process. To address these issues, we propose MAGE (Multimodal Alignment and Generation Enhancement), a novel framework that bridges the semantic spaces of vision and text through an innovative alignment mechanism. By introducing the Intelligent Alignment Network (IAN), MAGE achieves dimensional and semantic alignment. To reduce the gap between synonymous heterogeneous data, we employ a training strategy that combines cross-entropy and mean squared error, significantly enhancing the alignment effect. Moreover, to enhance MAGE's "Any-to-Any" capability, we developed a fine-tuning dataset for multimodal tool-calling instructions to expand the model's output capability boundaries. Finally, our proposed multimodal large model architecture, MAGE, achieved significantly better performance compared to similar works across various evaluation benchmarks, including MME, MMBench, and SEED. Complete code and appendix are available at: https://github.com/GTCOM-NLP/MAGE.
中文摘要:提出的MAGE框架通过智能对齐网络和混合训练策略,解决了多模态学习中视觉编码后的空间与语义损失问题,在多项评测基准中表现优异。
English Summary: The proposed MAGE framework addresses spatial and semantic losses in multimodal learning by introducing an Intelligent Alignment Network and a combined training strategy, achieving superior performance on multiple benchmarks.
Authors:Qianxiong Xu, Lanyun Zhu, Chenxi Liu, Guosheng Lin, Cheng Long, Ziyue Li, Rui Zhao
Abstract:
Visual Object Tracking (VOT) is widely used in applications like autonomous driving to continuously track targets in videos. Existing methods can be roughly categorized into template matching and autoregressive methods, where the former usually neglects the temporal dependencies across frames and the latter tends to get biased towards the object categories during training, showing weak generalizability to unseen classes. To address these issues, some methods propose to adapt the video foundation model SAM2 for VOT, where the tracking results of each frame would be encoded as memory for conditioning the rest of frames in an autoregressive manner. Nevertheless, existing methods fail to overcome the challenges of object occlusions and distractions, and do not have any measures to intercept the propagation of tracking errors. To tackle them, we present a SAMITE model, built upon SAM2 with additional modules, including: (1) Prototypical Memory Bank: We propose to quantify the feature-wise and position-wise correctness of each frame's tracking results, and select the best frames to condition subsequent frames. As the features of occluded and distracting objects are feature-wise and position-wise inaccurate, their scores would naturally be lower and thus can be filtered to intercept error propagation; (2) Positional Prompt Generator: To further reduce the impacts of distractors, we propose to generate positional mask prompts to provide explicit positional clues for the target, leading to more accurate tracking. Extensive experiments have been conducted on six benchmarks, showing the superiority of SAMITE. The code is available at https://github.com/Sam1224/SAMITE.
Chinese: SAMITE模型在SAM2视频基础模型上增加了原型记忆库来筛选准确帧和位置提示生成器以减少干扰,从而在多个基准测试中实现了卓越的视觉目标跟踪性能。
English: The SAMITE model enhances the SAM2 video foundation model for visual object tracking by introducing a Prototypical Memory Bank to filter out inaccurate frames and a Positional Prompt Generator to reduce distractions, achieving superior performance on multiple benchmarks.
Authors:Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang
Abstract:
As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance, demonstrating the effectiveness and generalizability of correction-based supervision. We release our dataset and code at https://github.com/mbzuai-nlp/UnsafeChain
中文: UnsafeChain通过构建包含困难提示的安全对齐数据集,训练模型识别并修正有害输出,在多项基准测试中既提升了安全性又保持了推理能力。
English: UnsafeChain addresses safety gaps in large reasoning models by using hard prompts to train models to correct unsafe responses, enhancing safety without compromising reasoning ability across multiple benchmarks.
Authors:Raiyan R. Khan, Philippe Chlenski, Itsik Pe'er
Abstract:
Current approaches to genomic sequence modeling often struggle to align the inductive biases of machine learning models with the evolutionarily-informed structure of biological systems. To this end, we formulate a novel application of hyperbolic CNNs that exploits this structure, enabling more expressive DNA sequence representations. Our strategy circumvents the need for explicit phylogenetic mapping while discerning key properties of sequences pertaining to core functional and regulatory behavior. Across 37 out of 42 genome interpretation benchmark datasets, our hyperbolic models outperform their Euclidean equivalents. Notably, our approach even surpasses state-of-the-art performance on seven GUE benchmark datasets, consistently outperforming many DNA language models while using orders of magnitude fewer parameters and avoiding pretraining. Our results include a novel set of benchmark datasets--the Transposable Elements Benchmark--which explores a major but understudied component of the genome with deep evolutionary significance. We further motivate our work by exploring how our hyperbolic models recognize genomic signal under various data-generating conditions and by constructing an empirical method for interpreting the hyperbolicity of dataset embeddings. Throughout these assessments, we find persistent evidence highlighting the potential of our hyperbolic framework as a robust paradigm for genome representation learning. Our code and benchmark datasets are available at https://github.com/rrkhan/HGE.
Chinese: 本研究提出了一种利用双曲CNN进行基因组序列建模的新方法,该方法在多个基准测试中优于欧几里得模型和最先进的DNA语言模型,且参数更少、无需预训练,展示了其作为稳健基因组表示学习范式的潜力。
English: This study introduces hyperbolic CNNs as a novel approach to genomic sequence modeling, which outperforms Euclidean models and state-of-the-art DNA language models on multiple benchmarks while using fewer parameters and no pretraining, demonstrating its potential for robust genome representation learning.
Authors:Leonard Hinckeldey, Elliot Fosong, Elle Miller, Rimvydas Rubavicius, Trevor McInroe, Patricia Wollstadt, Christiane B. Wiebel-Herboth, Subramanian Ramamoorthy, Stefano V. Albrecht
Abstract:
The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have dominated RL benchmarks because they present relevant challenges, are inexpensive to run and easy to understand. While games such as Go and Atari have led to many breakthroughs, they often do not directly translate to real-world embodied applications. In recognising the need to diversify RL benchmarks and addressing complexities that arise in embodied interaction scenarios, we introduce Assistax: an open-source benchmark designed to address challenges arising in assistive robotics tasks. Assistax uses JAX's hardware acceleration for significant speed-ups for learning in physics-based simulations. In terms of open-loop wall-clock time, Assistax runs up to $370\times$ faster when vectorising training runs compared to CPU-based alternatives. Assistax conceptualises the interaction between an assistive robot and an active human patient using multi-agent RL to train a population of diverse partner agents against which an embodied robotic agent's zero-shot coordination capabilities can be tested. Extensive evaluation and hyperparameter tuning for popular continuous control RL and MARL algorithms provide reliable baselines and establish Assistax as a practical benchmark for advancing RL research for assistive robotics. The code is available at: https://github.com/assistive-autonomy/assistax.
中文: Assistax是一个基于JAX硬件加速的开源基准测试平台,通过多智能体强化学习模拟辅助机器人与人类互动,其训练速度比CPU方案快370倍,旨在推动辅助机器人领域的强化学习研究。
English: Assistax is a new open-source benchmark using JAX-accelerated physics simulations to advance reinforcement learning for assistive robotics, featuring multi-agent training and 370× faster performance than CPU alternatives.
Authors:Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, Shuxiang Song
Abstract:
The success of visual tracking has been largely driven by datasets with manual box annotations. However, these box annotations require tremendous human effort, limiting the scale and diversity of existing tracking datasets. In this work, we present a novel Self-Supervised Tracking framework named \textbf{\tracker}, designed to eliminate the need of box annotations. Specifically, a decoupled spatio-temporal consistency training framework is proposed to learn rich target information across timestamps through global spatial localization and local temporal association. This allows for the simulation of appearance and motion variations of instances in real-world scenarios. Furthermore, an instance contrastive loss is designed to learn instance-level correspondences from a multi-view perspective, offering robust instance supervision without additional labels. This new design paradigm enables {\tracker} to effectively learn generic tracking representations in a self-supervised manner, while reducing reliance on extensive box annotations. Extensive experiments on nine benchmark datasets demonstrate that {\tracker} surpasses \textit{SOTA} self-supervised tracking methods, achieving an improvement of more than 25.3\%, 20.4\%, and 14.8\% in AUC (AO) score on the GOT10K, LaSOT, TrackingNet datasets, respectively. Code: https://github.com/GXNU-ZhongLab/SSTrack.
中文摘要:自监督跟踪框架SSTrack通过解耦的时空一致性训练方法和实例对比损失,无需人工框标注即可学习通用跟踪表示,在多个基准数据集上显著超越了现有最优方法。
English Summary: The Self-Supervised Tracking framework (SSTrack) eliminates the need for manual box annotations by employing a decoupled spatio-temporal consistency training method and instance contrastive loss, achieving state-of-the-art performance across multiple tracking benchmarks.
Authors:Jiong Yin, Liang Li, Jiehua Zhang, Yuhan Gao, Chenggang Yan, Xichun Sheng
Abstract:
Audio-visual multi-task incremental learning aims to continuously learn from multiple audio-visual tasks without the need for joint training on all tasks. The challenge of the problem is how to preserve the old task knowledge while facilitating the learning of new task with previous experiences. To address these challenges, we introduce a three-stage Progressive Homeostatic and Plastic audio-visual prompt (PHP) method. In the shallow phase, we design the task-shared modality aggregating adapter to foster cross-task and cross-modal audio-visual representation learning to enhance shared understanding between tasks. In the middle phase, we propose the task-specific modality-shared dynamic generating adapter, which constructs prompts that are tailored to individual tasks while remaining general across modalities, which balances the models ability to retain knowledge against forgetting with its potential for versatile multi-task transferability. In the deep phase, we introduce the task-specific modality-independent prompts to further refine the understand ability by targeting individual information for each task and modality. By incorporating these three phases, PHP retains task-specific prompts while adapting shared parameters for new tasks to effectively balance knowledge sharing and specificity. Our method achieves SOTA performance in different orders of four tasks (AVE, AVVP, AVS and AVQA). Our code can be available at https://github.com/ENJOY-Yin-jiong/PHP.
中文: 提出的渐进式稳态与可塑性(PHP)方法通过三个阶段的提示机制,在视听多任务增量学习中平衡知识保留与迁移,实现了最先进的性能。
English: The proposed Progressive Homeostatic and Plastic (PHP) method enables effective audio-visual multi-task incremental learning by balancing knowledge retention and transfer across tasks through three specialized prompt phases, achieving state-of-the-art performance.
Authors:Hao Ye, Mengshi Qi, Zhaohong Liu, Liang Liu, Huadong Ma
Abstract:
In this work, we study how vision-language models (VLMs) can be utilized to enhance the safety for the autonomous driving system, including perception, situational understanding, and path planning. However, existing research has largely overlooked the evaluation of these models in traffic safety-critical driving scenarios. To bridge this gap, we create the benchmark (SafeDrive228K) and propose a new baseline based on VLM with knowledge graph-based retrieval-augmented generation (SafeDriveRAG) for visual question answering (VQA). Specifically, we introduce SafeDrive228K, the first large-scale multimodal question-answering benchmark comprising 228K examples across 18 sub-tasks. This benchmark encompasses a diverse range of traffic safety queries, from traffic accidents and corner cases to common safety knowledge, enabling a thorough assessment of the comprehension and reasoning abilities of the models. Furthermore, we propose a plug-and-play multimodal knowledge graph-based retrieval-augmented generation approach that employs a novel multi-scale subgraph retrieval algorithm for efficient information retrieval. By incorporating traffic safety guidelines collected from the Internet, this framework further enhances the model's capacity to handle safety-critical situations. Finally, we conduct comprehensive evaluations on five mainstream VLMs to assess their reliability in safety-sensitive driving tasks. Experimental results demonstrate that integrating RAG significantly improves performance, achieving a +4.73% gain in Traffic Accidents tasks, +8.79% in Corner Cases tasks and +14.57% in Traffic Safety Commonsense across five mainstream VLMs, underscoring the potential of our proposed benchmark and methodology for advancing research in traffic safety. Our source code and data are available at https://github.com/Lumos0507/SafeDriveRAG.
中文: 本研究提出了首个用于评估视觉语言模型在交通安全关键场景下性能的大规模多模态基准SafeDrive228K,并开发了基于知识图谱的检索增强生成方法SafeDriveRAG,该方法显著提升了模型在多种安全任务中的表现。
English: This study introduces SafeDrive228K, the first large-scale multimodal benchmark for evaluating vision-language models in traffic safety-critical scenarios, and proposes SafeDriveRAG, a knowledge graph-based retrieval-augmented generation method that significantly enhances model performance across various safety tasks.
Authors:Yanxu Zhu, Shitong Duan, Xiangxu Zhang, Jitao Sang, Peng Zhang, Tun Lu, Xiao Zhou, Jing Yao, Xiaoyuan Yi, Xing Xie
Abstract:
Recently Multimodal Large Language Models (MLLMs) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs' capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs. We ground honesty in models' response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench, a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples, whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench, we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs' honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs. Our data and code can be found at https://github.com/DSTTSD/MoHoBench.
中文: 本研究首次提出MoHoBench基准,系统评估多模态大语言模型在面对视觉不可答问题时的诚实性,发现多数模型无法恰当拒答且视觉信息显著影响诚实表现,进而提出了改进的对齐方法。
English: This study introduces MoHoBench, the first benchmark to systematically evaluate the honesty of Multimodal Large Language Models (MLLMs) when faced with unanswerable visual questions, revealing that most models fail to refuse appropriately and that visual information significantly impacts honesty, leading to proposed alignment methods for improvement.
Authors:Wenxuan Bao, Ruxi Deng, Ruizhong Qiu, Tianxin Wei, Hanghang Tong, Jingrui He
Abstract:
Test-time adaptation with pre-trained vision-language models has gained increasing attention for addressing distribution shifts during testing. Among these approaches, memory-based algorithms stand out due to their training-free nature and ability to leverage historical test data. However, existing test-time adaptation methods are typically designed for a single domain with abundant data. In decentralized settings such as federated learning, applying these methods individually to each client suffers from limited test data, while directly sharing a single global memory via the server prevents proper personalization to each client's unique distribution. To address this, we propose Latte, a novel framework where each client maintains a local memory to store embeddings from its own historical test data and an external memory to store class prototypes from other relevant clients. During communication, each client retrieves prototypes from similar clients under the server's coordination to expand its memory. For local adaptation, Latte utilizes both embedding similarity and uncertainty to enhance model performance. Our theoretical analysis shows that Latte effectively leverages in-distribution clients while remaining robust to out-of-distribution clients. Extensive experiments on domain adaptation and corruption benchmarks validate that Latte achieves superior performance in decentralized settings, while introducing only negligible communication and computation costs. Our code is available at https://github.com/baowenxuan/Latte .
中文: 提出的Latte框架通过让客户端维护本地和外部记忆库实现分散式测试时自适应,在保持低通信和计算成本的同时显著提升模型性能。
English: The proposed Latte framework enables decentralized test-time adaptation by allowing clients to maintain local and external memories for personalized model updates, achieving superior performance with minimal communication and computation costs.
Authors:Zhichuan Wang, Yang Zhou, Zhe Liu, Rui Yu, Song Bai, Yulong Wang, Xinwei He, Xiang Bai
Abstract:
Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category information to align with CLIP's training objective for adaptation during training. Second, it provides external hints about unknown objects complementary to visual cues during inference. To improve the synergy, we introduce an Additive-Bias Low-Rank adaptation (AB-LoRA), which alleviates overfitting and further enhances the generalization to unseen categories. With only multi-view images, DAC significantly surpasses prior arts by an average of +10.01\% mAP on four open-set 3DOR datasets. Moreover, its generalization is also validated on image-based and cross-dataset setups. Code is available at https://github.com/wangzhichuan123/DAC.
中文摘要:DAC框架通过结合CLIP与多模态大语言模型,仅使用多视角图像学习泛化性强的三维表征,在开放集三维物体检索任务中大幅超越现有方法,在四个数据集上平均提升10.01% mAP。
English Summary: The DAC framework leverages CLIP and a multi-modal large language model to create generalized 3D representations using only multi-view images, achieving state-of-the-art performance in open-set 3D object retrieval by significantly outperforming previous methods across multiple datasets.
Authors:Marco Mambelli, Shrijan Swaminathan
Abstract:
Choosing the right resource can speed up job completion, better utilize the available hardware, and visibly reduce costs, especially when renting computers in the cloud. This was demonstrated in earlier studies on HEPCloud. However, the benchmarking of the resources proved to be a laborious and time-consuming process. This paper presents GlideinBenchmark, a new Web application leveraging the pilot infrastructure of GlideinWMS to benchmark resources, and it shows how to use the data collected and published by GlideinBenchmark to automate the optimal selection of resources. An experiment can select the benchmark or the set of benchmarks that most closely evaluate the performance of its workflows. GlideinBenchmark, with the help of the GlideinWMS Factory, controls the benchmark execution. Finally, a scheduler like HEPCloud's Decision Engine can use the results to optimize resource provisioning.
中文: GlideinBenchmark是一个利用GlideinWMS基础设施自动进行资源基准测试的Web应用程序,通过优化资源配置实现作业加速和云计算成本节约。
English: GlideinBenchmark is a web application that uses GlideinWMS infrastructure to automate resource benchmarking, enabling optimized resource selection for faster job completion and cost reduction in cloud computing.
Authors:Marco Mambelli, Bruno Moreira Coimbra, Namratha Urs, Ilya Baburashvili
Abstract:
GlideinWMS is a workload manager provisioning resources for many experiments, including CMS and DUNE. The software is distributed both as native packages and specialized production containers. Following an approach used in other communities like web development, we built our workspaces, system-like containers to ease development and testing. Developers can change the source tree or check out a different branch and quickly reconfigure the services to see the effect of their changes. In this paper, we will talk about what differentiates workspaces from other containers. We will describe our base system, composed of three containers: a one-node cluster including a compute element and a batch system, a GlideinWMS Factory controlling pilot jobs, and a scheduler and Frontend to submit jobs and provision resources. Additional containers can be used for optional components. This system can easily run on a laptop, and we will share our evaluation of different container runtimes, with an eye for ease of use and performance. Finally, we will talk about our experience as developers and with students. The GlideinWMS workspaces are easily integrated with IDEs like VS Code, simplifying debugging and allowing development and testing of the system even when offline. They simplified the training and onboarding of new team members and summer interns. And they were useful in workshops where students could have first-hand experience with the mechanisms and components that, in production, run millions of jobs.
中文: GlideinWMS工作空间是专门的系统容器,通过快速重新配置和与IDE集成简化了开发、测试及培训,同时借助模块化容器架构支持可扩展的作业管理。
English: GlideinWMS workspaces are specialized system containers that simplify development, testing, and training by enabling quick reconfiguration and integration with IDEs, while supporting scalable job management through a modular container architecture.
Authors:Wen Huang, Yanmei Gu, Zhiming Wang, Huijia Zhu, Yanmin Qian
Abstract:
As speech generation technology advances, the risk of misuse through deepfake audio has become a pressing concern, which underscores the critical need for robust detection systems. However, many existing speech deepfake datasets are limited in scale and diversity, making it challenging to train models that can generalize well to unseen deepfakes. To address these gaps, we introduce SpeechFake, a large-scale dataset designed specifically for speech deepfake detection. SpeechFake includes over 3 million deepfake samples, totaling more than 3,000 hours of audio, generated using 40 different speech synthesis tools. The dataset encompasses a wide range of generation techniques, including text-to-speech, voice conversion, and neural vocoder, incorporating the latest cutting-edge methods. It also provides multilingual support, spanning 46 languages. In this paper, we offer a detailed overview of the dataset's creation, composition, and statistics. We also present baseline results by training detection models on SpeechFake, demonstrating strong performance on both its own test sets and various unseen test sets. Additionally, we conduct experiments to rigorously explore how generation methods, language diversity, and speaker variation affect detection performance. We believe SpeechFake will be a valuable resource for advancing speech deepfake detection and developing more robust models for evolving generation techniques.
中文:SpeechFake数据集通过包含40种合成工具生成的300万多样本、覆盖46种语言,解决了现有语音深度伪造数据规模不足的问题,为开发应对新型伪造技术的检测模型提供了重要资源。
English: The SpeechFake dataset addresses the limitations of existing speech deepfake detection resources by offering over 3 million diverse samples across 46 languages and 40 synthesis methods, enabling robust model training and evaluation against evolving threats.
Authors:Han Wu, Chong Wang, Zhiming Cui
Abstract:
Semi-supervised learning has proven highly effective in tackling the challenge of limited labeled training data in medical image segmentation. In general, current approaches, which rely on intra-image pixel-wise consistency training via pseudo-labeling, overlook the consistency at more comprehensive semantic levels (e.g., object region) and suffer from severe discrepancy of extracted features resulting from an imbalanced number of labeled and unlabeled data. To overcome these limitations, we present a new \underline{Du}al \underline{C}ross-\underline{i}mage \underline{S}emantic \underline{C}onsistency (DuCiSC) learning framework, for semi-supervised medical image segmentation. Concretely, beyond enforcing pixel-wise semantic consistency, DuCiSC proposes dual paradigms to encourage region-level semantic consistency across: 1) labeled and unlabeled images; and 2) labeled and fused images, by explicitly aligning their prototypes. Relying on the dual paradigms, DuCiSC can effectively establish consistent cross-image semantics via prototype representations, thereby addressing the feature discrepancy issue. Moreover, we devise a novel self-aware confidence estimation strategy to accurately select reliable pseudo labels, allowing for exploiting the training dynamics of unlabeled data. Our DuCiSC method is extensively validated on four datasets, including two popular binary benchmarks in segmenting the left atrium and pancreas, a multi-class Automatic Cardiac Diagnosis Challenge dataset, and a challenging scenario of segmenting the inferior alveolar nerve that features complicated anatomical structures, showing superior segmentation results over previous state-of-the-art approaches. Our code is publicly available at \href{https://github.com/ShanghaiTech-IMPACT/DuCiSC}{https://github.com/ShanghaiTech-IMPACT/DuCiSC}.
中文:DuCiSC框架通过引入跨图像区域级语义一致性和自感知置信度策略,解决了半监督医学图像分割中的特征差异问题,并在多个数据集上实现了优越的分割效果。
English: The DuCiSC framework enhances semi-supervised medical image segmentation by introducing dual cross-image semantic consistency at the region level and a self-aware confidence strategy to address feature discrepancies and improve pseudo-label reliability, achieving superior results across multiple datasets.
Authors:Haiquan Wang, Yi Chen, Shang Zeng, Yun Bian, Zhe Cui
Abstract:
Current evaluations of LLMs in the government domain primarily focus on safety considerations in specific scenarios, while the assessment of the models' own core capabilities, particularly domain relevance, remains insufficient. To address this gap, we propose GovRelBench, a benchmark specifically designed for evaluating the core capabilities of LLMs in the government domain. GovRelBench consists of government domain prompts and a dedicated evaluation tool, GovRelBERT. During the training process of GovRelBERT, we introduce the SoftGovScore method: this method trains a model based on the ModernBERT architecture by converting hard labels to soft scores, enabling it to accurately compute the text's government domain relevance score. This work aims to enhance the capability evaluation framework for large models in the government domain, providing an effective tool for relevant research and practice. Our code and dataset are available at https://github.com/pan-xi/GovRelBench.
中文: 当前政府领域大语言模型评估多关注特定场景安全性,而忽略模型核心能力如领域相关性,为此我们提出GovRelBench基准,通过政府领域提示词和基于SoftGovScore方法的GovRelBERT评估工具,精准计算文本与政府领域的相关性得分。
English: Current evaluations of LLMs in the government domain lack focus on core capabilities like domain relevance, leading to the development of GovRelBench, a benchmark with specialized prompts and the GovRelBERT evaluation tool that uses the SoftGovScore method to assess government-related text accuracy.
Authors:Amber Huang, Ian Scott Knight, Slava Naprienko
Abstract:
LIT-PCBA is widely used to benchmark virtual screening models, but our audit reveals that it is fundamentally compromised. We find extensive data leakage and molecular redundancy across its splits, including 2D-identical ligands within and across partitions, pervasive analog overlap, and low-diversity query sets. In ALDH1 alone, for instance, 323 active training -- validation analog pairs occur at ECFP4 Tanimoto similarity $\geq 0.6$; across all targets, 2,491 2D-identical inactives appear in both training and validation, with very few corresponding actives. These overlaps allow models to succeed through scaffold memorization rather than generalization, inflating enrichment factors and AUROC scores. These flaws are not incidental -- they are so severe that a trivial memorization-based baseline with no learnable parameters can exploit them to match or exceed the reported performance of state-of-the-art deep learning and 3D-similarity models. As a result, nearly all published results on LIT-PCBA are undermined. Even models evaluated in "zero-shot" mode are affected by analog leakage into the query set, weakening claims of generalization. In its current form, the benchmark does not measure a model's ability to recover novel chemotypes and should not be taken as evidence of methodological progress.
All code, data, and baseline implementations are available at: https://github.com/sievestack/LIT-PCBA-audit
中文摘要:LIT-PCBA基准测试因存在严重的数据泄露和分子冗余问题,导致模型通过记忆而非泛化能力获得虚高表现,这使得绝大多数已发表的研究结论失去有效性。
English Summary: The LIT-PCBA benchmark is critically flawed due to extensive data leakage and molecular redundancy, allowing models to achieve inflated performance through memorization rather than genuine generalization, thereby invalidating most published results.
Authors:Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, Changyou Chen
Abstract:
We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations. In addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and inference-time scaling in text-to-image generations.
中文: LLaVA-Reward是一种高效的奖励模型,利用预训练多模态大语言模型的隐藏状态和跳跃连接交叉注意力模块,从多角度自动评估文生图生成质量,在自动评估和推理扩展方面优于现有方法。
English: LLaVA-Reward is an efficient reward model that uses pretrained multimodal large language models to automatically evaluate text-to-image generations from multiple perspectives by leveraging hidden states and a Skip-connection Cross Attention module for improved visual-textual reasoning.
Authors:Jicheng Yuan, Manh Nguyen Duc, Qian Liu, Manfred Hauswirth, Danh Le Phuoc
Abstract:
Vision-based bird's-eye-view (BEV) 3D object detection has advanced significantly in autonomous driving by offering cost-effectiveness and rich contextual information. However, existing methods often construct BEV representations by collapsing extracted object features, neglecting intrinsic environmental contexts, such as roads and pavements. This hinders detectors from comprehensively perceiving the characteristics of the physical world. To alleviate this, we introduce a multi-task learning framework, Collaborative Perceiver (CoP), that leverages spatial occupancy as auxiliary information to mine consistent structural and conceptual similarities shared between 3D object detection and occupancy prediction tasks, bridging gaps in spatial representations and feature refinement. To this end, we first propose a pipeline to generate dense occupancy ground truths incorporating local density information (LDO) for reconstructing detailed environmental information. Next, we employ a voxel-height-guided sampling (VHS) strategy to distill fine-grained local features according to distinct object properties. Furthermore, we develop a global-local collaborative feature fusion (CFF) module that seamlessly integrates complementary knowledge between both tasks, thus composing more robust BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that CoP outperforms existing vision-based frameworks, achieving 49.5\% mAP and 59.2\% NDS on the test set. Code and supplementary materials are available at this link https://github.com/jichengyuan/Collaborative-Perceiver.
Chinese Summary: 本研究提出了一种名为协同感知器(CoP)的多任务学习框架,通过整合空间占据信息来改进基于视觉的3D物体检测,在nuScenes基准测试中实现了最先进的性能。
English Summary: The study introduces a multi-task learning framework called Collaborative Perceiver (CoP) that enhances vision-based 3D object detection by integrating spatial occupancy information, achieving state-of-the-art performance on the nuScenes benchmark.
Authors:Amirmohammad Shamaei, Alexander Stebner, Salome, Bosshart, Johanna Ospel, Gouri Ginde, Mariana Bento, Roberto Souza
Abstract:
Magnetic resonance imaging (MRI) is a crucial medical imaging modality. However, long acquisition times remain a significant challenge, leading to increased costs, and reduced patient comfort. Recent studies have shown the potential of using deep learning models that incorporate information from prior subject-specific MRI scans to improve reconstruction quality of present scans. Integrating this prior information requires registration of the previous scan to the current image reconstruction, which can be time-consuming. We propose a novel deep-learning-based MRI reconstruction framework which consists of an initial reconstruction network, a deep registration model, and a transformer-based enhancement network. We validated our method on a longitudinal dataset of T1-weighted MRI scans with 2,808 images from 18 subjects at four acceleration factors (R5, R10, R15, R20). Quantitative metrics confirmed our approach's superiority over existing methods (p < 0.05, Wilcoxon signed-rank test). Furthermore, we analyzed the impact of our MRI reconstruction method on the downstream task of brain segmentation and observed improved accuracy and volumetric agreement with reference segmentations. Our approach also achieved a substantial reduction in total reconstruction time compared to methods that use traditional registration algorithms, making it more suitable for real-time clinical applications. The code associated with this work is publicly available at https://github.com/amirshamaei/longitudinal-mri-deep-recon.
中文: 本研究提出了一种新颖的深度学习MRI重建框架,通过配准模型和Transformer网络整合先验扫描,相比现有方法显著提升了图像质量、缩短了重建时间,并改善了脑部分割任务的准确性。
English: This study introduces a novel deep-learning MRI reconstruction framework that integrates prior scans via a registration model and transformer network, significantly improving image quality, accelerating reconstruction time, and enhancing downstream brain segmentation accuracy compared to existing methods.
Authors:Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, Horst Samulowitz
Abstract:
Extracting structured information from text, such as key-value pairs that could augment tabular data, is quite useful in many enterprise use cases. Although large language models (LLMs) have enabled numerous automated pipelines for converting natural language into structured formats, there is still a lack of benchmarks for evaluating their extraction quality, especially in specific domains or focused documents specific to a given organization. Building such benchmarks by manual annotations is labour-intensive and limits the size and scalability of the benchmarks. In this work, we present StructText, an end-to-end framework for automatically generating high-fidelity benchmarks for key-value extraction from text using existing tabular data. It uses available tabular data as structured ground truth, and follows a two-stage ``plan-then-execute'' pipeline to synthetically generate corresponding natural-language text. To ensure alignment between text and structured source, we introduce a multi-dimensional evaluation strategy that combines (a) LLM-based judgments on factuality, hallucination, and coherence and (b) objective extraction metrics measuring numeric and temporal accuracy. We evaluated the proposed method on 71,539 examples across 49 datasets. Results reveal that while LLMs achieve strong factual accuracy and avoid hallucination, they struggle with narrative coherence in producing extractable text. Notably, models presume numerical and temporal information with high fidelity yet this information becomes embedded in narratives that resist automated extraction. We release a framework, including datasets, evaluation tools, and baseline extraction systems, to support continued research.
中文: 本文提出StructText框架,通过自动生成高质量基准来评估文本中的键值对提取,解决了评估方法缺乏可扩展性的问题,并揭示了大语言模型虽保持事实准确性但在生成文本的叙事连贯性方面存在不足。
English: This paper introduces StructText, an automated framework for generating high-fidelity benchmarks to evaluate key-value extraction from text, addressing the lack of scalable evaluation methods while revealing that LLMs maintain factual accuracy but struggle with narrative coherence in generated text.
Authors:Feixiang Zhou, Zhuangzhi Gao, He Zhao, Jianyang Xie, Yanda Meng, Yitian Zhao, Gregory Y. H. Lip, Yalin Zheng
Abstract:
Accurate segmentation of tubular structures, such as vascular networks, plays a critical role in various medical domains. A remaining significant challenge in this task is structural fragmentation, which can adversely impact downstream applications. Existing methods primarily focus on designing various loss functions to constrain global topological structures. However, they often overlook local discontinuity regions, leading to suboptimal segmentation results. To overcome this limitation, we propose a novel Global-to-Local Connectivity Preservation (GLCP) framework that can simultaneously perceive global and local structural characteristics of tubular networks. Specifically, we propose an Interactive Multi-head Segmentation (IMS) module to jointly learn global segmentation, skeleton maps, and local discontinuity maps, respectively. This enables our model to explicitly target local discontinuity regions while maintaining global topological integrity. In addition, we design a lightweight Dual-Attention-based Refinement (DAR) module to further improve segmentation quality by refining the resulting segmentation maps. Extensive experiments on both 2D and 3D datasets demonstrate that our GLCP achieves superior accuracy and continuity in tubular structure segmentation compared to several state-of-the-art approaches. The source codes will be available at https://github.com/FeixiangZhou/GLCP.
中文: 本研究提出的全局-局部连通性保持(GLCP)框架通过交互式多头分割模块同步感知管状网络的全局拓扑特征与局部断裂区域,结合双注意力优化机制,在二维和三维数据集上实现了优于现有方法的血管分割精度与连续性。
English: The proposed Global-to-Local Connectivity Preservation (GLCP) framework addresses structural fragmentation in tubular structure segmentation by jointly learning global topological features and local discontinuity regions, achieving superior accuracy and continuity through interactive multi-head segmentation and dual-attention refinement.
Authors:Théo Sourget, David Restrepo, Céline Hudelot, Enzo Ferrante, Stergios Christodoulidis, Maria Vakalopoulou
Abstract:
Motivated by the strong performance of CLIP-based models in natural image-text domains, recent efforts have adapted these architectures to medical tasks, particularly in radiology, where large paired datasets of images and reports, such as chest X-rays, are available. While these models have shown encouraging results in terms of accuracy and discriminative performance, their fairness and robustness in the different clinical tasks remain largely underexplored. In this study, we extensively evaluate six widely used CLIP-based models on chest X-ray classification using three publicly available datasets: MIMIC-CXR, NIH-CXR14, and NEATX. We assess the models fairness across six conditions and patient subgroups based on age, sex, and race. Additionally, we assess the robustness to shortcut learning by evaluating performance on pneumothorax cases with and without chest drains. Our results indicate performance gaps between patients of different ages, but more equitable results for the other attributes. Moreover, all models exhibit lower performance on images without chest drains, suggesting reliance on spurious correlations. We further complement the performance analysis with a study of the embeddings generated by the models. While the sensitive attributes could be classified from the embeddings, we do not see such patterns using PCA, showing the limitations of these visualisation techniques when assessing models. Our code is available at https://github.com/TheoSourget/clip_cxr_fairness
中文: 近期研究评估了六种基于CLIP的模型在胸部X光分类中的公平性和鲁棒性,发现不同年龄组患者存在性能差异,且模型依赖胸管等伪相关性,同时揭示了嵌入技术在检测敏感属性方面的局限性。
English: Recent research evaluates the fairness and robustness of six CLIP-based models on chest X-ray classification, revealing performance disparities across age groups and reliance on spurious correlations like chest drains, while also examining embedding limitations in detecting sensitive attributes.
Authors:Amartya Banerjee, Xingyu Xu, Caroline Moosmüller, Harlin Lee
Abstract:
In an inverse problem, the goal is to recover an unknown parameter (e.g., an image) that has typically undergone some lossy or noisy transformation during measurement. Recently, deep generative models, particularly diffusion models, have emerged as powerful priors for protein structure generation. However, integrating noisy experimental data from multiple sources to guide these models remains a significant challenge. Existing methods often require precise knowledge of experimental noise levels and manually tuned weights for each data modality. In this work, we introduce Adam-PnP, a Plug-and-Play framework that guides a pre-trained protein diffusion model using gradients from multiple, heterogeneous experimental sources. Our framework features an adaptive noise estimation scheme and a dynamic modality weighting mechanism integrated into the diffusion process, which reduce the need for manual hyperparameter tuning. Experiments on complex reconstruction tasks demonstrate significantly improved accuracy using Adam-PnP.
Chinese Summary: Adam-PnP是一种即插即用框架,通过自适应噪声估计和动态多源实验数据加权机制引导蛋白质扩散模型,显著减少了人工参数调整需求并提升了结构重建精度。
English Summary: Adam-PnP is a Plug-and-Play framework that enhances protein structure reconstruction by guiding diffusion models with adaptive noise estimation and dynamic weighting of multiple experimental data sources, reducing manual tuning while improving accuracy.
Authors:Van Chung Nguyen, Pratik Walunj, Chuong Le, An Duy Nguyen, Hung Manh La
Abstract:
Nonlinear Model Predictive Control (NMPC) is a powerful approach for controlling highly dynamic robotic systems, as it accounts for system dynamics and optimizes control inputs at each step. However, its high computational complexity makes implementation on resource-constrained microcontrollers impractical. While recent studies have demonstrated the feasibility of Model Predictive Control (MPC) with linearized dynamics on microcontrollers, applying full NMPC remains a significant challenge. This work presents an efficient solution for generating and deploying NMPC on microcontrollers (NMPCM) to control quadrotor UAVs. The proposed method optimizes computational efficiency while maintaining high control accuracy. Simulations in Gazebo/ROS and real-world experiments validate the effectiveness of the approach, demonstrating its capability to achieve high-frequency NMPC execution in real-time systems. The code is available at: https://github.com/aralab-unr/NMPCM.
中文: 本研究提出了一种在微控制器上高效实现非线性模型预测控制(NMPCM)以控制四旋翼无人机的方法,在保持高精度的同时优化了计算效率,仿真和实际实验均验证了其实时性能。
English: This work introduces an efficient method for implementing Nonlinear Model Predictive Control on microcontrollers (NMPCM) to control quadrotor UAVs, optimizing computational efficiency while maintaining high accuracy, with both simulations and real-world experiments validating its real-time performance.
Authors:Christopher Indris, Raiyan Rahman, Goetz Bramesfeld, Guanghui Wang
Abstract:
Aerial wildlife tracking is critical for conservation efforts and relies on detecting small objects on the ground below the aircraft. It presents technical challenges: crewed aircraft are expensive, risky and disruptive; autonomous drones have limited computational capacity for onboard AI systems. Since the objects of interest may appear only a few pixels wide, small object detection is an inherently challenging computer vision subfield compounded by computational efficiency needs. This paper applies a patching augmentation to datasets to study model performance under various settings. A comparative study of three common yet architecturally diverse object detectors is conducted using the data, varying the patching method's hyperparameters against detection accuracy. Each model achieved at least 93\% mAP@IoU=0.5 on at least one patching configuration. Statistical analyses provide an in-depth commentary on the effects of various factors. Analysis also shows that faster, simpler models are about as effective as models that require more computational power for this task and perform well given limited patch scales, encouraging UAV deployment. Datasets and models will be made available via https://github.com/chrisindris/Moose.
Chinese: 本研究证明,对数据集应用分块增强方法可实现高效的空中野生动物追踪小目标检测,且简化模型与计算密集型模型性能相当,有利于在自主无人机上部署。
English: This study demonstrates that applying patching augmentation to datasets enables efficient small object detection for aerial wildlife tracking, with simpler models performing comparably to computationally intensive ones, facilitating deployment on autonomous drones.
Authors:Yingxuan Yang, Mulei Ma, Yuxuan Huang, Huacan Chai, Chenyu Gong, Haoran Geng, Yuanjian Zhou, Ying Wen, Meng Fang, Muhao Chen, Shangding Gu, Ming Jin, Costas Spanos, Yang Yang, Pieter Abbeel, Dawn Song, Weinan Zhang, Jun Wang
Abstract:
The emergence of AI agents powered by large language models (LLMs) marks a pivotal shift toward the Agentic Web, a new phase of the internet defined by autonomous, goal-driven interactions. In this paradigm, agents interact directly with one another to plan, coordinate, and execute complex tasks on behalf of users. This transition from human-driven to machine-to-machine interaction allows intent to be delegated, relieving users from routine digital operations and enabling a more interactive, automated web experience. In this paper, we present a structured framework for understanding and building the Agentic Web. We trace its evolution from the PC and Mobile Web eras and identify the core technological foundations that support this shift. Central to our framework is a conceptual model consisting of three key dimensions: intelligence, interaction, and economics. These dimensions collectively enable the capabilities of AI agents, such as retrieval, recommendation, planning, and collaboration. We analyze the architectural and infrastructural challenges involved in creating scalable agentic systems, including communication protocols, orchestration strategies, and emerging paradigms such as the Agent Attention Economy. We conclude by discussing the potential applications, societal risks, and governance issues posed by agentic systems, and outline research directions for developing open, secure, and intelligent ecosystems shaped by both human intent and autonomous agent behavior. A continuously updated collection of relevant studies for agentic web is available at: https://github.com/SafeRL-Lab/agentic-web.
中文摘要:基于大语言模型的AI智能体正推动互联网向"智能体网络"演进,通过自主交互实现复杂任务,需建立涵盖智能、交互和经济维度的新框架来应对技术架构与社会治理的双重挑战。
English Summary: The emergence of AI agents powered by large language models is driving the transition to an Agentic Web, where autonomous agents perform complex tasks through machine-to-machine interactions, requiring new frameworks to address technological and societal challenges.
Authors:Haowei Lin, Xiangyu Wang, Jianzhu Ma, Yitao Liang
Abstract:
Scaling laws are fundamental mathematical relationships that predict how neural network performance evolves with changes in variables such as model size, dataset size, and computational resources. Traditionally, discovering these laws requires extensive human expertise and manual experimentation. We introduce EvoSLD, an automated framework for Scaling Law Discovery (SLD) that leverages evolutionary algorithms guided by Large Language Models (LLMs) to co-evolve symbolic expressions and their optimization routines. Formulated to handle scaling variables, control variables, and response metrics across diverse experimental settings, EvoSLD searches for parsimonious, universal functional forms that minimize fitting errors on grouped data subsets. Evaluated on five real-world scenarios from recent literature, EvoSLD rediscovers exact human-derived laws in two cases and surpasses them in others, achieving up to orders-of-magnitude reductions in normalized mean squared error on held-out test sets. Compared to baselines like symbolic regression and ablated variants, EvoSLD demonstrates superior accuracy, interpretability, and efficiency, highlighting its potential to accelerate AI research. Code is available at https://github.com/linhaowei1/SLD.
中文: 本文提出的SLDAgent通过协同优化自主发现扩展定律,首次证明人工智能生成的定律在各项任务中均能超越人工推导的对应定律,在预测精度和实际应用方面展现出显著优势。
English: This paper introduces SLDAgent, an evolution-based agent that autonomously discovers scaling laws through co-optimization, demonstrating for the first time that AI-generated laws consistently outperform human-derived counterparts in accuracy and practical utility across diverse tasks.
Authors:Donglu Yang, Liang Zhang, Zihao Yue, Liangyu Chen, Yichen Xu, Wenxuan Wang, Qin Jin
Abstract:
Charts are a fundamental visualization format widely used in data analysis across research and industry. While enabling users to edit charts based on high-level intentions is of great practical value, existing methods primarily rely on natural language instructions, which are often too ambiguous to support fine-grained editing. In this work, we introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators that explicitly highlight the elements to be modified. To support this paradigm, we present Chart$\text{M}^3$, a new benchmark for Multimodal chart editing with Multi-level complexity and Multi-perspective evaluation. Chart$\text{M}^3$ contains 1,000 samples spanning four levels of editing difficulty. Each sample includes triplets in the form of (chart, code, multimodal instructions). To comprehensively evaluate chart editing models, Chart$\text{M}^3$ provides metrics that assess both visual appearance and code correctness. Our benchmark reveals significant limitations in current multimodal large language models (MLLMs), including GPT-4o, particularly in their ability to interpret and act on visual indicators. To address this, we construct Chart$\text{M}^3$-Train, a large-scale training set with 24,000 multimodal chart editing samples. Fine-tuning MLLMs on this dataset leads to substantial improvements, demonstrating the importance of multimodal supervision in building practical chart editing systems. Our datasets, codes, and evaluation tools are available at https://github.com/MLrollIT/ChartM3. %https://github.com/MLrollIT/ChartM3Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE.
中文: 本文提出了一种结合自然语言与视觉标记的多模态图表编辑新范式,建立了ChartM3基准进行评估,并证明在专业数据集上微调多模态大模型能显著提升其视觉编辑指令的解析能力。
English: This paper introduces a multimodal chart editing paradigm combining natural language and visual indicators, presents the ChartM3 benchmark for evaluation, and demonstrates that fine-tuning MLLMs on a specialized dataset significantly improves performance in interpreting visual editing instructions.
Authors:Nicolas Pinon, Carole Lartizien
Abstract:
Unsupervised anomaly detection (UAD) aims to detect anomalies without labeled data, a necessity in many machine learning applications where anomalous samples are rare or not available. Most state-of-the-art methods fall into two categories: reconstruction-based approaches, which often reconstruct anomalies too well, and decoupled representation learning with density estimators, which can suffer from suboptimal feature spaces. While some recent methods attempt to couple feature learning and anomaly detection, they often rely on surrogate objectives, restrict kernel choices, or introduce approximations that limit their expressiveness and robustness. To address this challenge, we propose a novel method that tightly couples representation learning with an analytically solvable one-class SVM (OCSVM), through a custom loss formulation that directly aligns latent features with the OCSVM decision boundary. The model is evaluated on two tasks: a new benchmark based on MNIST-C, and a challenging brain MRI subtle lesion detection task. Unlike most methods that focus on large, hyperintense lesions at the image level, our approach succeeds to target small, non-hyperintense lesions, while we evaluate voxel-wise metrics, addressing a more clinically relevant scenario. Both experiments evaluate a form of robustness to domain shifts, including corruption types in MNIST-C and scanner/age variations in MRI. Results demonstrate performance and robustness of our proposed mode,highlighting its potential for general UAD and real-world medical imaging applications. The source code is available at https://github.com/Nicolas-Pinon/uad_ocsvm_guided_repr_learning
中文摘要:本文提出了一种新颖的无监督异常检测方法,通过自定义损失函数将表征学习与一类支持向量机紧密结合,在领域偏移基准测试和医学影像任务中检测细微异常方面展现出卓越的性能和鲁棒性。
English Summary: This paper introduces a novel unsupervised anomaly detection method that tightly couples representation learning with a one-class SVM through a custom loss function, demonstrating superior performance and robustness in detecting subtle anomalies across domain-shifted benchmarks and medical imaging tasks.
Authors:Aditya Pujari, Ajita Rattani
Abstract:
The rapid advancement of voice generation technologies has enabled the synthesis of speech that is perceptually indistinguishable from genuine human voices. While these innovations facilitate beneficial applications such as personalized text-to-speech systems and voice preservation, they have also introduced significant risks, including deepfake impersonation scams and synthetic media-driven disinformation campaigns. Recent reports indicate that in 2024, deepfake fraud attempts surged by over 1,300% compared to 2023, underscoring the urgent need for robust audio content authentication. The financial sector has been particularly impacted, with a loss of over 10 million USD to voice scams and individual victims reporting losses exceeding $6,000 from AI-generated deepfake calls. In response, regulators and governments worldwide are enacting measures to improve AI content transparency and traceability, emphasizing the development of forensic tools and watermarking techniques as essential strategies to uphold media integrity.
中文: 语音生成技术虽能合成逼真语音,却也带来深度伪造诈骗等严重风险,2024年欺诈激增超1300%,造成千万美元损失,全球正通过开发认证工具和加强监管来应对。
English: Voice generation technology enables realistic speech synthesis but poses severe risks like deepfake scams, with fraud attempts surging over 1,300% in 2024 and financial losses exceeding $10 million, prompting global efforts to develop authentication tools and regulations.
Authors:Oleg Atamanenko, Anna Chalova, Joseph Coombes, Nikki Cope, Phillip Dang, Zhifeng Deng, Jimmy Du, Michael Ermolenko, Feifan Fan, Yufei Feng, Cheryl Fichter, Pavel Filimonov, Louis Fischer, Kylan Gibbs, Valeria Gusarova, Pavel Karpik, Andreas Assad Kottner, Ian Lee, Oliver Louie, Jasmine Mai, Mikhail Mamontov, Suri Mao, Nurullah Morshed, Igor Poletaev, Florin Radu, Dmytro Semernia, Evgenii Shingarev, Vikram Sivaraja, Peter Skirko, Rinat Takhautdinov, Robert Villahermosa, Jean Wang
Abstract:
We introduce Inworld TTS-1, a set of two Transformer-based autoregressive text-to-speech (TTS) models. Our largest model, TTS-1-Max, has 8.8B parameters and is designed for utmost quality and expressiveness in demanding applications. TTS-1 is our most efficient model, with 1.6B parameters, built for real-time speech synthesis and on-device use cases. By scaling train-time compute and applying a sequential process of pre-training, fine-tuning, and RL-alignment of the speech-language model (SpeechLM) component, both models achieve state-of-the-art performance on a variety of benchmarks, demonstrating exceptional quality relying purely on in-context learning of the speaker's voice. Inworld TTS-1 and TTS-1-Max can generate high-resolution 48 kHz speech with low latency, and support 11 languages with fine-grained emotional control and non-verbal vocalizations through audio markups. We additionally open-source our training and modeling code under an MIT license.
中文: Inworld TTS-1推出两款基于Transformer的语音合成模型,其中88亿参数的TTS-1-Max面向高质量应用,16亿参数的TTS-1适用于实时场景,通过先进训练方法实现顶尖性能,支持48kHz多语言语音合成与精细情感控制。
English: Inworld TTS-1 introduces two Transformer-based TTS models, with the 8.8B-parameter TTS-1-Max for high-quality applications and the 1.6B-parameter TTS-1 for real-time use, both achieving state-of-the-art performance through advanced training and supporting 48kHz multilingual speech with emotional control.
Authors:Zheng Hui, Yijiang River Dong, Ehsan Shareghi, Nigel Collier
Abstract:
As large language models (LLMs) are increasingly deployed in high-risk domains such as law, finance, and medicine, systematically evaluating their domain-specific safety and compliance becomes critical. While prior work has largely focused on improving LLM performance in these domains, it has often neglected the evaluation of domain-specific safety risks. To bridge this gap, we first define domain-specific safety principles for LLMs based on the AMA Principles of Medical Ethics, the ABA Model Rules of Professional Conduct, and the CFA Institute Code of Ethics. Building on this foundation, we introduce Trident-Bench, a benchmark specifically targeting LLM safety in the legal, financial, and medical domains. We evaluated 19 general-purpose and domain-specialized models on Trident-Bench and show that it effectively reveals key safety gaps -- strong generalist models (e.g., GPT, Gemini) can meet basic expectations, whereas domain-specialized models often struggle with subtle ethical nuances. This highlights an urgent need for finer-grained domain-specific safety improvements. By introducing Trident-Bench, our work provides one of the first systematic resources for studying LLM safety in law and finance, and lays the groundwork for future research aimed at reducing the safety risks of deploying LLMs in professionally regulated fields. Code and benchmark will be released at: https://github.com/zackhuiiiii/TRIDENT
中文: 本文提出Trident-Bench基准,用于评估大语言模型在法律、金融和医疗领域的领域特定安全性,发现专业模型在伦理细节上存在明显不足,而通用模型仅能达到基本要求,凸显了细化安全改进的迫切需求。
English: This paper introduces Trident-Bench, a benchmark for evaluating domain-specific safety of large language models in legal, financial, and medical fields, revealing critical gaps where specialized models struggle with ethical nuances despite general-purpose models meeting basic expectations.
Authors:Karan Mirhosseini, Arya Aftab, Alireza Sheikh
Abstract:
In an era of radical technology transformations, technology maps play a crucial role in enhancing decision making. These maps heavily rely on automated methods of technology extraction. This paper introduces Retrieval Augmented Technology Extraction (RATE), a Large Language Model (LLM) based pipeline for automated technology extraction from scientific literature. RATE combines Retrieval Augmented Generation (RAG) with multi-definition LLM-based validation. This hybrid method results in high recall in candidate generation alongside with high precision in candidate filtering. While the pipeline is designed to be general and widely applicable, we demonstrate its use on 678 research articles focused on Brain-Computer Interfaces (BCIs) and Extended Reality (XR) as a case study. Consequently, The validated technology terms by RATE were mapped into a co-occurrence network, revealing thematic clusters and structural features of the research landscape. For the purpose of evaluation, a gold standard dataset of technologies in 70 selected random articles had been curated by the experts. In addition, a technology extraction model based on Bidirectional Encoder Representations of Transformers (BERT) was used as a comparative method. RATE achieved F1-score of 91.27%, Significantly outperforming BERT with F1-score of 53.73%. Our findings highlight the promise of definition-driven LLM methods for technology extraction and mapping. They also offer new insights into emerging trends within the BCI-XR field. The source code is available https://github.com/AryaAftab/RATE
中文: 本文提出RATE框架,通过结合检索增强生成与多定义验证的LLM技术提取方法,在脑机接口与扩展现实案例中实现91.27%的F1值,显著优于BERT模型,为技术图谱构建提供新方案。
English: This paper introduces RATE, an LLM-based pipeline that combines retrieval-augmented generation with multi-definition validation to achieve high-precision automated technology extraction from scientific literature, significantly outperforming BERT with a 91.27% F1-score in BCI-XR case studies.
Authors:Bereket A. Yilma, Luis A. Leiva
Abstract:
Art Therapy (AT) is an established practice that facilitates emotional processing and recovery through creative expression. Recently, Visual Art Recommender Systems (VA RecSys) have emerged to support AT, demonstrating their potential by personalizing therapeutic artwork recommendations. Nonetheless, current VA RecSys rely on visual stimuli for user modeling, limiting their ability to capture the full spectrum of emotional responses during preference elicitation. Previous studies have shown that music stimuli elicit unique affective reflections, presenting an opportunity for cross-domain recommendation (CDR) to enhance personalization in AT. Since CDR has not yet been explored in this context, we propose a family of CDR methods for AT based on music-driven preference elicitation. A large-scale study with 200 users demonstrates the efficacy of music-driven preference elicitation, outperforming the classic visual-only elicitation approach. Our source code, data, and models are available at https://github.com/ArtAICare/Affect-aware-CDR
中文: 艺术疗法通过基于音乐驱动偏好诱导的跨领域推荐方法得到增强,该方法在捕捉情感反应方面优于传统的仅视觉方法。
English: Art therapy is enhanced by cross-domain recommendation methods that use music-driven preference elicitation, which outperforms traditional visual-only approaches in capturing emotional responses.
Authors:Franck Bardol
Abstract:
Large Language Models like GPT-4 adjust their responses not only based on the question asked, but also on how it is emotionally phrased. We systematically vary the emotional tone of 156 prompts - spanning controversial and everyday topics - and analyze how it affects model responses. Our findings show that GPT-4 is three times less likely to respond negatively to a negatively framed question than to a neutral one. This suggests a "rebound" bias where the model overcorrects, often shifting toward neutrality or positivity. On sensitive topics (e.g., justice or politics), this effect is even more pronounced: tone-based variation is suppressed, suggesting an alignment override. We introduce concepts like the "tone floor" - a lower bound in response negativity - and use tone-valence transition matrices to quantify behavior. Visualizations based on 1536-dimensional embeddings confirm semantic drift based on tone. Our work highlights an underexplored class of biases driven by emotional framing in prompts, with implications for AI alignment and trust. Code and data are available at: https://github.com/bardolfranck/llm-responses-viewer
Chinese: GPT-4的回应受提示情感语调显著影响,表现出“反弹”偏差,即在负面措辞上过度修正为中立或积极回应,尤其在敏感话题上更为明显,揭示了AI对齐中一类未被充分探索的偏见。
English: GPT-4's responses are significantly influenced by the emotional tone of prompts, showing a "rebound" bias where it overcorrects negative phrasing by shifting toward neutrality or positivity, especially on sensitive topics, revealing an underexplored class of biases in AI alignment.
Authors:Yukang Cao, Jiahao Lu, Zhisheng Huang, Zhuowen Shen, Chengfeng Zhao, Fangzhou Hong, Zhaoxi Chen, Xin Li, Wenping Wang, Yuan Liu, Ziwei Liu
Abstract:
Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures, the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction. To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 -- reconstruction of low-level 3D attributes (e.g., depth, pose, and point maps); (2) Level 2 -- reconstruction of 3D scene components (e.g., objects, humans, structures); (3) Level 3 -- reconstruction of 4D dynamic scenes; (4) Level 4 -- modeling of interactions among scene components; and (5) Level 5 -- incorporation of physical laws and constraints. We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.
Chinese: 本综述提出了一个层次化的4D空间智能重建框架,将现有方法划分为从基础3D属性到物理约束的五个递进层级,同时探讨了当前研究空白和未来发展方向。
English: This survey introduces a hierarchical framework for 4D spatial intelligence reconstruction, organizing methods into five progressive levels from basic 3D attributes to physical constraints, while addressing current gaps and future directions in the field.
Authors:Haoyang Liu, Yijiang Li, Haohan Wang
Abstract:
Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data.
On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F$_1$ of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at https://github.com/Liu-Hy/GenoMAS.
中文: GenoMAS 通过整合结构化工作流与自主代理的LLM科学家团队,解决了当前基因表达分析自动化的局限,在基准测试中表现优异,并能发现生物学上合理的基因-表型关联。
English: GenoMAS introduces a team of LLM-based scientists that combines structured workflows with autonomous agents to overcome the limitations of current automation in gene expression analysis, achieving superior performance on benchmarks and uncovering biologically plausible gene-phenotype associations.
Authors:Weichen Zhang, Yiyou Sun, Pohao Huang, Jiayue Pu, Heyue Lin, Dawn Song
Abstract:
Hallucinations pose critical risks for large language model (LLM)-based agents, often manifesting as hallucinative actions resulting from fabricated or misinterpreted information within the cognitive context. While recent studies have exposed such failures, existing evaluations remain fragmented and lack a principled testbed. In this paper, we present MIRAGE-Bench--Measuring Illusions in Risky AGEnt settings--the first unified benchmark for eliciting and evaluating hallucinations in interactive LLM-agent scenarios. We begin by introducing a three-part taxonomy to address agentic hallucinations: actions that are unfaithful to (i) task instructions, (ii) execution history, or (iii) environment observations. To analyze, we first elicit such failures by performing a systematic audit of existing agent benchmarks, then synthesize test cases using a snapshot strategy that isolates decision points in deterministic and reproducible manners. To evaluate hallucination behaviors, we adopt a fine-grained-level LLM-as-a-Judge paradigm with tailored risk-aware prompts, enabling scalable, high-fidelity assessment of agent actions without enumerating full action spaces. MIRAGE-Bench provides actionable insights on failure modes of LLM agents and lays the groundwork for principled progress in mitigating hallucinations in interactive environments.
中文摘要:MIRAGE-Bench是首个用于评估交互式LLM智能体幻觉的统一基准,通过三要素分类法和系统化测试方法,有效识别任务执行与环境交互中的错误行为。
English Summary: MIRAGE-Bench is the first unified benchmark for evaluating hallucinations in interactive LLM agents, featuring a three-part taxonomy and systematic testing methodology to identify failures in task execution and environment interaction.
Authors:Licai Sun, Xingxun Jiang, Haoyu Chen, Yante Li, Zheng Lian, Biu Liu, Yuan Zong, Wenming Zheng, Jukka M. Leppänen, Guoying Zhao
Abstract:
Current facial emotion recognition systems are predominately trained to predict a fixed set of predefined categories or abstract dimensional values. This constrained form of supervision hinders generalization and applicability, as it reduces the rich and nuanced spectrum of emotions into oversimplified labels or scales. In contrast, natural language provides a more flexible, expressive, and interpretable way to represent emotions, offering a much broader source of supervision. Yet, leveraging semantically rich natural language captions as supervisory signals for facial emotion representation learning remains relatively underexplored, primarily due to two key challenges: 1) the lack of large-scale caption datasets with rich emotional semantics, and 2) the absence of effective frameworks tailored to harness such rich supervision. To this end, we introduce EmoCap100K, a large-scale facial emotion caption dataset comprising over 100,000 samples, featuring rich and structured semantic descriptions that capture both global affective states and fine-grained local facial behaviors. Building upon this dataset, we further propose EmoCapCLIP, which incorporates a joint global-local contrastive learning framework enhanced by a cross-modal guided positive mining module. This design facilitates the comprehensive exploitation of multi-level caption information while accommodating semantic similarities between closely related expressions. Extensive evaluations on over 20 benchmarks covering five tasks demonstrate the superior performance of our method, highlighting the promise of learning facial emotion representations from large-scale semantically rich captions. The code and data will be available at https://github.com/sunlicai/EmoCapCLIP.
中文: 当前面部情绪识别系统受限于过度简化的标签,而本研究提出了包含丰富语义描述的大规模数据集EmoCap100K和采用对比学习的EmoCapCLIP框架,通过利用自然语言监督显著提升了情绪表征能力,在多项基准测试中表现优异。
English: Current facial emotion recognition systems are limited by oversimplified labels, but this research introduces EmoCap100K, a large-scale dataset with rich semantic captions, and EmoCapCLIP, a framework that uses contrastive learning to improve emotion representation, demonstrating superior performance across multiple benchmarks.
Authors:Fang Li
Abstract:
Deep Neural Networks (DNNs) deliver impressive performance but their black-box nature limits deployment in high-stakes domains requiring transparency. We introduce Compositional Function Networks (CFNs), a novel framework that builds inherently interpretable models by composing elementary mathematical functions with clear semantics. Unlike existing interpretable approaches that are limited to simple additive structures, CFNs support diverse compositional patterns -- sequential, parallel, and conditional -- enabling complex feature interactions while maintaining transparency. A key innovation is that CFNs are fully differentiable, allowing efficient training through standard gradient descent. We demonstrate CFNs' versatility across multiple domains, from symbolic regression to image classification with deep hierarchical networks. Our empirical evaluation shows CFNs achieve competitive performance against black-box models (96.24% accuracy on CIFAR-10) while outperforming state-of-the-art interpretable models like Explainable Boosting Machines. By combining the hierarchical expressiveness and efficient training of deep learning with the intrinsic interpretability of well-defined mathematical functions, CFNs offer a powerful framework for applications where both performance and accountability are paramount.
Chinese: 组合函数网络(CFNs)提出了一种本质可解释的框架,通过组合基础数学函数实现与黑盒模型相竞争的性能,同时借助多样化组合模式和可微分训练确保透明度。
English: Compositional Function Networks (CFNs) introduce an inherently interpretable framework that composes elementary mathematical functions to achieve competitive performance with black-box models while ensuring transparency through diverse compositional patterns and differentiable training.
Authors:Shen Li, Liuyi Yao, Wujia Niu, Lan Zhang, Yaliang Li
Abstract:
Large visual-language models (LVLMs) integrate aligned large language models (LLMs) with visual modules to process multimodal inputs. However, the safety mechanisms developed for text-based LLMs do not naturally extend to visual modalities, leaving LVLMs vulnerable to harmful image inputs. To address this cross-modal safety gap, we introduce security tensors - trainable input vectors applied during inference through either the textual or visual modality. These tensors transfer textual safety alignment to visual processing without modifying the model's parameters. They are optimized using a curated dataset containing (i) malicious image-text pairs requiring rejection, (ii) contrastive benign pairs with text structurally similar to malicious queries, with the purpose of being contrastive examples to guide visual reliance, and (iii) general benign samples preserving model functionality. Experimental results demonstrate that both textual and visual security tensors significantly enhance LVLMs' ability to reject diverse harmful visual inputs while maintaining near-identical performance on benign tasks. Further internal analysis towards hidden-layer representations reveals that security tensors successfully activate the language module's textual "safety layers" in visual inputs, thereby effectively extending text-based safety to the visual modality.
中文摘要:安全张量作为可训练的输入向量被引入,能将文本安全机制迁移到大型视觉语言模型的视觉处理中,有效增强其拒绝有害视觉输入的能力,同时保持良性任务性能。
English Summary: Security tensors are introduced as trainable input vectors that transfer textual safety mechanisms to visual processing in large visual-language models, effectively enhancing their ability to reject harmful visual inputs while preserving performance on benign tasks.
Authors:Xinhan Di, Kristin Qi, Pengqian Yu
Abstract:
Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive evaluation frameworks that assess both visual and audio quality, and there are insufficient benchmarks for region-specific performance analysis. To address these gaps, we introduce the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1), comprising a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evaluation protocol for assessing joint audio-video generation of whole-body animatable avatars. Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance, which incidates essential areas for future research. The dataset and evaluation tools are publicly available at https://github.com/deepreasonings/WholeBodyBenchmark.
中文: 本研究推出了JWB-DH-V1数据集和评估框架,旨在解决全身动作与语音联合生成中的不足,并通过性能差异揭示了未来研究的关键方向。
English: The study introduces JWB-DH-V1, a comprehensive dataset and evaluation framework to address gaps in joint whole-body motion and speech generation, revealing performance disparities that highlight key research areas.
Authors:David Ye, Jan Williams, Mars Gao, Stefano Riva, Matteo Tomasetto, David Zoro, J. Nathan Kutz
Abstract:
SHallow REcurrent Decoders (SHRED) provide a deep learning strategy for modeling high-dimensional dynamical systems and/or spatiotemporal data from dynamical system snapshot observations. PySHRED is a Python package that implements SHRED and several of its major extensions, including for robust sensing, reduced order modeling and physics discovery. In this paper, we introduce the version 1.0 release of PySHRED, which includes data preprocessors and a number of cutting-edge SHRED methods specifically designed to handle real-world data that may be noisy, multi-scale, parameterized, prohibitively high-dimensional, and strongly nonlinear. The package is easy to install, thoroughly-documented, supplemented with extensive code examples, and modularly-structured to support future additions. The entire codebase is released under the MIT license and is available at https://github.com/pyshred-dev/pyshred.
中文: PySHRED是一个实现SHRED深度学习框架的Python软件包,用于建模高维动力系统,具备强大的数据处理能力和模块化设计,适用于实际应用场景。
English: PySHRED is a Python package implementing the SHRED deep learning framework for modeling high-dimensional dynamical systems, featuring robust data handling and modular design for real-world applications.
Authors:Likun Tan, Kuan-Wei Huang, Kevin Wu
Abstract:
Hallucinations in large language models pose a critical challenge for applications requiring factual reliability, particularly in high-stakes domains such as finance. This work presents an effective approach for detecting and editing factually incorrect content in model-generated responses based on the provided context. Given a user-defined domain-specific error taxonomy, we construct a synthetic dataset by inserting tagged errors into financial question-answering corpora and then fine-tune four language models, Phi-4, Phi-4-mini, Qwen3-4B, and Qwen3-14B, to detect and edit these factual inaccuracies. Our best-performing model, fine-tuned Phi-4, achieves an 8% improvement in binary F1 score and a 30% gain in overall detection performance compared to OpenAI-o3. Notably, our fine-tuned Phi-4-mini model, despite having only 4 billion parameters, maintains competitive performance with just a 2% drop in binary detection and a 0.1% decline in overall detection compared to OpenAI-o3. Our work provides a practical solution for detecting and editing factual inconsistencies in financial text generation while introducing a generalizable framework that can enhance the trustworthiness and alignment of large language models across diverse applications beyond finance. Our code and data are available at https://github.com/pegasi-ai/shield.
中文: 本研究提出一种基于合成金融数据集微调模型的方法,用于检测和修正大语言模型中的事实性错误,显著提升了性能,并为增强模型可信度提供了可推广的框架。
English: This study introduces a method to detect and edit factual inaccuracies in large language models by fine-tuning models like Phi-4 on a synthetic financial dataset, achieving significant performance gains and offering a generalizable framework for improving model reliability.
Authors:Hongzhi Zhang, Zhonglie Liu, Kun Meng, Jiameng Chen, Jia Wu, Bo Du, Di Lin, Yan Che, Wenbin Hu
Abstract:
Given the vastness of chemical space and the ongoing emergence of previously uncharacterized proteins, zero-shot compound-protein interaction (CPI) prediction better reflects the practical challenges and requirements of real-world drug development. Although existing methods perform adequately during certain CPI tasks, they still face the following challenges: (1) Representation learning from local or complete protein sequences often overlooks the complex interdependencies between subsequences, which are essential for predicting spatial structures and binding properties. (2) Dependence on large-scale or scarce multimodal protein datasets demands significant training data and computational resources, limiting scalability and efficiency. To address these challenges, we propose a novel approach that pretrains protein representations for CPI prediction tasks using subsequence reordering, explicitly capturing the dependencies between protein subsequences. Furthermore, we apply length-variable protein augmentation to ensure excellent pretraining performance on small training datasets. To evaluate the model's effectiveness and zero-shot learning ability, we combine it with various baseline methods. The results demonstrate that our approach can improve the baseline model's performance on the CPI task, especially in the challenging zero-shot scenario. Compared to existing pre-training models, our model demonstrates superior performance, particularly in data-scarce scenarios where training samples are limited. Our implementation is available at https://github.com/Hoch-Zhang/PSRP-CPI.
中文摘要:本研究提出了一种新颖的蛋白质表示预训练方法,通过子序列重排和长度可变增强技术来提升零样本化合物-蛋白质相互作用预测性能,在数据稀缺场景下表现尤为突出。
English Summary: This study introduces a novel protein representation pre-training method using subsequence reordering and length-variable augmentation to enhance zero-shot compound-protein interaction prediction, demonstrating superior performance especially in data-scarce scenarios.
Authors:Minh Hieu Ha, Hung Phan, Tung Duy Doan, Tung Dao, Dao Tran, Huynh Thi Thanh Binh
Abstract:
Multi-objective combinatorial optimization problems (MOCOP) frequently arise in practical applications that require the simultaneous optimization of conflicting objectives. Although traditional evolutionary algorithms can be effective, they typically depend on domain knowledge and repeated parameter tuning, limiting flexibility when applied to unseen MOCOP instances. Recently, integration of Large Language Models (LLMs) into evolutionary computation has opened new avenues for automatic heuristic generation, using their advanced language understanding and code synthesis capabilities. Nevertheless, most existing approaches predominantly focus on single-objective tasks, often neglecting key considerations such as runtime efficiency and heuristic diversity in multi-objective settings. To bridge this gap, we introduce Multi-heuristics for MOCOP via Pareto-Grid-guided Evolution of LLMs (MPaGE), a novel enhancement of the Simple Evolutionary Multiobjective Optimization (SEMO) framework that leverages LLMs and Pareto Front Grid (PFG) technique. By partitioning the objective space into grids and retaining top-performing candidates to guide heuristic generation, MPaGE utilizes LLMs to prioritize heuristics with semantically distinct logical structures during variation, thus promoting diversity and mitigating redundancy within the population. Through extensive evaluations, MPaGE demonstrates superior performance over existing LLM-based frameworks, and achieves competitive results to traditional Multi-objective evolutionary algorithms (MOEAs), with significantly faster runtime. Our code is available at: https://github.com/langkhachhoha/MPaGE.
中文摘要:MPaGE是一种新颖的基于大语言模型的多目标优化框架,通过帕累托网格引导进化自动生成多样化启发式策略,在保持竞争力的同时显著提升了运行效率。
English Summary: MPaGE is a novel LLM-enhanced evolutionary framework that uses Pareto-Grid guidance to automatically generate diverse heuristics for multi-objective combinatorial optimization, achieving competitive performance with significantly faster runtime than traditional methods.
Authors:Kai Ye, YingShi Luan, Zhudi Chen, Guangyue Meng, Pingyang Dai, Liujuan Cao
Abstract:
Referring Image Segmentation (RIS), which aims to segment specific objects based on natural language descriptions, plays an essential role in vision-language understanding. Despite its progress in remote sensing applications, RIS in Low-Altitude Drone (LAD) scenarios remains underexplored. Existing datasets and methods are typically designed for high-altitude and static-view imagery. They struggle to handle the unique characteristics of LAD views, such as diverse viewpoints and high object density. To fill this gap, we present RIS-LAD, the first fine-grained RIS benchmark tailored for LAD scenarios. This dataset comprises 13,871 carefully annotated image-text-mask triplets collected from realistic drone footage, with a focus on small, cluttered, and multi-viewpoint scenes. It highlights new challenges absent in previous benchmarks, such as category drift caused by tiny objects and object drift under crowded same-class objects. To tackle these issues, we propose the Semantic-Aware Adaptive Reasoning Network (SAARN). Rather than uniformly injecting all linguistic features, SAARN decomposes and routes semantic information to different stages of the network. Specifically, the Category-Dominated Linguistic Enhancement (CDLE) aligns visual features with object categories during early encoding, while the Adaptive Reasoning Fusion Module (ARFM) dynamically selects semantic cues across scales to improve reasoning in complex scenes. The experimental evaluation reveals that RIS-LAD presents substantial challenges to state-of-the-art RIS algorithms, and also demonstrates the effectiveness of our proposed model in addressing these challenges. The dataset and code will be publicly released soon at: https://github.com/AHideoKuzeA/RIS-LAD/.
中文: 本文提出了首个面向低空无人机场景的细粒度参考图像分割基准RIS-LAD,并设计了语义感知自适应推理网络SAARN,通过分层路由语义特征有效解决了微小物体和密集场景带来的新挑战。
English: This paper introduces RIS-LAD, the first fine-grained benchmark for referring image segmentation in low-altitude drone scenarios, and proposes SAARN, a semantic-aware adaptive reasoning network that dynamically routes linguistic features to effectively address challenges like tiny objects and crowded scenes.
Authors:Renhang Liu, Chia-Yu Hung, Navonil Majumder, Taylor Gautreaux, Amir Ali Bagherzadeh, Chuan Li, Dorien Herremans, Soujanya Poria
Abstract:
Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.
中文:JAM模型通过流匹配和美学对齐技术,在歌曲生成中实现了词级时序与时长控制,其公开评估数据集JAME建立了标准化评测体系,在音乐特定属性上超越了现有模型。
English: The JAM model introduces word-level timing and duration control in song generation using flow-matching and aesthetic alignment, outperforming existing models in music-specific attributes while standardizing evaluation with its public dataset JAME.
Authors:Ziling Wu, Armaghan Moemeni, Praminda Caleb-Solly
Abstract:
Unsupervised object discovery (UOD) aims to detect and segment objects in 2D images without handcrafted annotations. Recent progress in self-supervised representation learning has led to some success in UOD algorithms. However, the absence of ground truth provides existing UOD methods with two challenges: 1) determining if a discovered region is foreground or background, and 2) knowing how many objects remain undiscovered. To address these two problems, previous solutions rely on foreground priors to distinguish if the discovered region is foreground, and conduct one or fixed iterations of discovery. However, the existing foreground priors are heuristic and not always robust, and a fixed number of discoveries leads to under or over-segmentation, since the number of objects in images varies. This paper introduces UnionCut, a robust and well-grounded foreground prior based on min-cut and ensemble methods that detects the union of foreground areas of an image, allowing UOD algorithms to identify foreground objects and stop discovery once the majority of the foreground union in the image is segmented. In addition, we propose UnionSeg, a distilled transformer of UnionCut that outputs the foreground union more efficiently and accurately. Our experiments show that by combining with UnionCut or UnionSeg, previous state-of-the-art UOD methods witness an increase in the performance of single object discovery, saliency detection and self-supervised instance segmentation on various benchmarks. The code is available at https://github.com/YFaris/UnionCut.
中文摘要:本文提出UnionCut,一种基于最小割和集成方法的鲁棒前景先验,通过精确识别前景区域并优化发现迭代次数,有效解决无监督对象发现中的关键挑战,显著提升了多个基准测试的性能。
English Summary: This paper introduces UnionCut, a robust foreground prior using min-cut and ensemble methods to address challenges in unsupervised object discovery by accurately identifying foreground regions and optimizing discovery iterations, significantly enhancing performance across various benchmarks.
Authors:Yilun Qiu, Tianhao Shi, Xiaoyan Zhao, Fengbin Zhu, Yang Zhang, Fuli Feng
Abstract:
Large language models (LLMs) are increasingly integrated into users' daily lives, leading to a growing demand for personalized outputs. Previous work focuses on leveraging a user's own history, overlooking inter-user differences that are crucial for effective personalization. While recent work has attempted to model such differences, the reliance on language-based prompts often hampers the effective extraction of meaningful distinctions. To address these issues, we propose Difference-aware Embedding-based Personalization (DEP), a framework that models inter-user differences in the latent space instead of relying on language prompts. DEP constructs soft prompts by contrasting a user's embedding with those of peers who engaged with similar content, highlighting relative behavioral signals. A sparse autoencoder then filters and compresses both user-specific and difference-aware embeddings, preserving only task-relevant features before injecting them into a frozen LLM. Experiments on personalized review generation show that DEP consistently outperforms baseline methods across multiple metrics. Our code is available at https://github.com/SnowCharmQ/DEP.
中文摘要:DEP框架通过潜在空间中的对比嵌入和稀疏自编码器建模用户间差异,解决了大语言模型个性化中的现有局限,在个性化评论生成任务中表现优于基线方法。
English Summary: The DEP framework addresses limitations in personalizing large language models by modeling inter-user differences in latent space using contrastive embeddings and sparse autoencoders, achieving superior performance in personalized review generation.
Authors:Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian
Abstract:
Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce prohibitive computational overhead to achieve superior performance using complementary visual representations from multiple vision encoders. To address this, we propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR), that eliminates redundant visual tokens across the encoding, fusion, and decoding stages for multi-encoder MLLMs. For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy. Subsequently, for multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning. Finally, we propose an adaptive token pruning method in the LLM decoding stage to further discard irrelevant tokens based on the text prompts with dynamically adjusting pruning ratios for specific task demands. To our best knowledge, this is the first successful attempt that achieves an efficient multi-encoder based vision language model with multi-stage pruning strategies. Extensive experiments on 11 benchmarks demonstrate the effectiveness of our proposed approach. Compared with EAGLE, a typical multi-encoder MLLMs, METEOR reduces 76% visual tokens with only 0.3% performance drop in average. The code is available at https://github.com/YuchenLiu98/METEOR.
中文摘要:提出的METEOR框架通过多阶段剪枝策略,在编码、融合和解码环节逐步消除冗余视觉标记,实现了高效的多编码器视觉语言模型,在减少76%视觉标记的同时仅造成0.3%的性能损失。
English Summary: The proposed METEOR framework progressively prunes redundant visual tokens across encoding, fusion, and decoding stages to create an efficient multi-encoder vision-language model that reduces 76% of visual tokens with minimal performance loss.
Authors:Jakob Snel, Seong Joon Oh
Abstract:
Hallucination, the generation of untruthful content, is one of the major concerns regarding foundational models. Detecting hallucinations at the token level is vital for real-time filtering and targeted correction, yet the variation of hallucination signals within token sequences is not fully understood. Leveraging the RAGTruth corpus with token-level annotations and reproduced logits, we analyse how these signals depend on a token's position within hallucinated spans, contributing to an improved understanding of token-level hallucination. Our results show that the first hallucinated token carries a stronger signal and is more detectable than conditional tokens. We release our analysis framework, along with code for logit reproduction and metric computation at https://github.com/jakobsnl/RAGTruth_Xtended.
中文: 大型语言模型常产生幻觉,利用RAGTruth语料库的研究发现,首个幻觉标记的检测率远高于后续标记,这一结构特性在不同模型中均保持一致。
English: Large Language Models often produce hallucinations, and a study using the RAGTruth corpus reveals that the first hallucinated token is significantly more detectable than subsequent ones, a pattern consistent across models.
Authors:Jakob Snel, Seong Joon Oh
Abstract:
Large Language Models (LLMs) hallucinate, and detecting these cases is key to ensuring trust. While many approaches address hallucination detection at the response or span level, recent work explores token-level detection, enabling more fine-grained intervention. However, the distribution of hallucination signal across sequences of hallucinated tokens remains unexplored. We leverage token-level annotations from the RAGTruth corpus and find that the first hallucinated token is far more detectable than later ones. This structural property holds across models, suggesting that first hallucination tokens play a key role in token-level hallucination detection. Our code is available at https://github.com/jakobsnl/RAGTruth_Xtended.
中文: 大型语言模型常产生幻觉,利用RAGTruth语料库的研究发现,首个幻觉标记的检测率远高于后续标记,这一结构特性在不同模型中均保持一致。
English: Large Language Models often produce hallucinations, and a study using the RAGTruth corpus reveals that the first hallucinated token is significantly more detectable than subsequent ones, a pattern consistent across models.
Authors:Chunshi Wang, Bin Zhao, Shuxue Ding
Abstract:
Building footprint extraction holds immense significance in remote sensing image analysis and has great value in urban planning, land use, environmental protection and disaster assessment. Despite the progress made by conventional and deep learning approaches in this field, they continue to encounter significant challenges. This paper introduces a novel plug-and-play attention module, Split Coordinate Attention (SCA), which ingeniously captures spatially remote interactions by employing two spatial range of pooling kernels, strategically encoding each channel along x and y planes, and separately performs a series of split operations for each feature group, thus enabling more efficient semantic feature extraction. By inserting into a 2D CNN to form an effective SCANet, our SCANet outperforms recent SOTA methods on the public Wuhan University (WHU) Building Dataset and Massachusetts Building Dataset in terms of various metrics. Particularly SCANet achieves the best IoU, 91.61% and 75.49% for the two datasets. Our code is available at https://github.com/AiEson/SCANet
Chinese: 本文提出了一种新颖的分裂坐标注意力(SCA)模块,通过有效捕捉空间交互和语义特征,在基准数据集上实现了最先进的建筑轮廓提取性能。
English: This paper introduces a novel Split Coordinate Attention (SCA) module that enhances building footprint extraction by efficiently capturing spatial interactions and semantic features, achieving state-of-the-art performance on benchmark datasets.
Authors:Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, Yu Qiao
Abstract:
Multimodal Large Language Models (MLLMs) exhibit impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning is their heavy reliance on curated image-text supervision. To solve this problem, we introduce a novel framework, ``Reasoning-Rendering-Visual-Feedback'' (RRVF), that enables MLLMs to learn complex visual reasoning from only raw images. This framework builds on the ``Asymmetry of Verification'' principle, i.e., verifying the rendered output against the source image is substantially easier than performing deep visual reasoning to generate a faithful, structured representation such as code. We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning (RL), thereby reducing reliance on image-text supervision. RRVF implements a closed-loop iterative process encompassing reasoning, rendering, and visual feedback components, enabling the model to perform complex reasoning, including self-correction through multi-turn interactions. This process is optimized end-to-end using the GRPO algorithm. Extensive evaluations are conducted on image-to-code generation across two diverse domains: data charts and web interfaces. The RRVF-trained model not only outperforms existing similarly sized open-source MLLMs and supervised fine-tuning baselines but also exhibits superior generalization. Notably, the model outperforms the more advanced MLLM used to generate visual feedback during training. Code is available at https://github.com/L-O-I/RRVF.
Chinese: RRVF框架通过推理、渲染和视觉反馈的闭环迭代过程,使多模态大语言模型能够直接从原始图像中学习复杂视觉推理,减少了对人工标注图像-文本数据的依赖,并在图像到代码生成任务中展现出卓越性能。
English: The RRVF framework enables Multimodal Large Language Models to learn complex visual reasoning directly from raw images by leveraging an iterative process of reasoning, rendering, and visual feedback, reducing reliance on curated image-text supervision and demonstrating superior performance in image-to-code generation tasks.
Authors:Kangcheng Bin, Chen Chen, Ting Hu, Jiahao Qi, Ping Zhong
Abstract:
Multimodal fusion has become a key enabler for UAV-based object detection, as each modality provides complementary cues for robust feature extraction. However, due to significant differences in resolution, field of view, and sensing characteristics across modalities, accurate registration is a prerequisite before fusion. Despite its importance, there is currently no publicly available benchmark specifically designed for multimodal registration in UAV-based aerial scenarios, which severely limits the development and evaluation of advanced registration methods under real-world conditions. To bridge this gap, we present ATR-UMMIM, the first benchmark dataset specifically tailored for multimodal image registration in UAV-based applications. This dataset includes 7,969 triplets of raw visible, infrared, and precisely registered visible images captured covers diverse scenarios including flight altitudes from 80m to 300m, camera angles from 0° to 75°, and all-day, all-year temporal variations under rich weather and illumination conditions. To ensure high registration quality, we design a semi-automated annotation pipeline to introduce reliable pixel-level ground truth to each triplet. In addition, each triplet is annotated with six imaging condition attributes, enabling benchmarking of registration robustness under real-world deployment settings. To further support downstream tasks, we provide object-level annotations on all registered images, covering 11 object categories with 77,753 visible and 78,409 infrared bounding boxes. We believe ATR-UMMIM will serve as a foundational benchmark for advancing multimodal registration, fusion, and perception in real-world UAV scenarios. The datatset can be download from https://github.com/supercpy/ATR-UMMIM
中文: ATR-UMMIM是首个专为无人机多模态图像配准设计的基准数据集,包含7,969组可见光与红外图像三元组,提供精确的像素级标注和多样化的真实场景条件,旨在推动配准与感知技术的发展。
English: The ATR-UMMIM dataset is the first benchmark specifically designed for multimodal image registration in UAV applications, providing 7,969 triplets of visible and infrared images with precise pixel-level annotations and diverse real-world conditions to advance registration and perception technologies.
Authors:Zhuoer Yin, Calvin Yeung, Tomohiro Suzuki, Ryota Tanaka, Keisuke Fujii
Abstract:
Recent transformer based approaches have demonstrated impressive performance in solving real-world 3D human pose estimation problems. Albeit these approaches achieve fruitful results on benchmark datasets, they tend to fall short of sports scenarios where human movements are more complicated than daily life actions, as being hindered by motion blur, occlusions, and domain shifts. Moreover, due to the fact that critical motions in a sports game often finish in moments of time (e.g., shooting), the ability to focus on momentary actions is becoming a crucial factor in sports analysis, where current methods appear to struggle with instantaneous scenarios. To overcome these limitations, we introduce KASportsFormer, a novel transformer based 3D pose estimation framework for sports that incorporates a kinematic anatomy-informed feature representation and integration module. In which the inherent kinematic motion information is extracted with the Bone Extractor (BoneExt) and Limb Fuser (LimbFus) modules and encoded in a multimodal manner. This improved the capability of comprehending sports poses in short videos. We evaluate our method through two representative sports scene datasets: SportsPose and WorldPose. Experimental results show that our proposed method achieves state-of-the-art results with MPJPE errors of 58.0mm and 34.3mm, respectively. Our code and models are available at: https://github.com/jw0r1n/KASportsFormer
中文: 近期基于Transformer的方法在常规三维人体姿态估计中表现出色,但在运动场景中因复杂动作和瞬时行为受限;为此提出的KASportsFormer通过融合运动学解剖特征,在体育数据集上取得了最优性能。
English: Recent transformer-based methods excel in general 3D human pose estimation but struggle with sports scenarios due to complex movements and momentary actions, prompting the development of KASportsFormer which integrates kinematic anatomy features to achieve state-of-the-art results on sports datasets.
Authors:Zeyu Huang, Wei Meng, Quan Liu, Kun Chen, Li Ma
Abstract:
Spiking neural networks offer low energy consumption due to their event-driven nature. Beyond binary spike outputs, their intrinsic floating-point dynamics merit greater attention. Neuronal threshold levels and reset modes critically determine spike count and timing. Hard reset cause information loss, while soft reset apply uniform treatment to neurons. To address these issues, we design an adaptive reset neuron that establishes relationships between inputs, outputs, and reset, while integrating a simple yet effective threshold adjustment strategy. Experimental results demonstrate that our method achieves excellent performance while maintaining lower energy consumption. In particular, it attains state-of-the-art accuracy on Tiny-ImageNet and CIFAR10-DVS. Codes are available at https://github.com/2ephyrus/AR-LIF.
Chinese: 该自适应重置神经元通过建立输入-输出-重置关系并整合阈值调节策略,解决了脉冲神经网络中的信息丢失和统一处理问题,在Tiny-ImageNet等数据集上以低能耗实现了最优精度。
English: The proposed adaptive reset neuron addresses information loss and uniform treatment issues in spiking neural networks by establishing input-output-reset relationships and incorporating a threshold adjustment strategy, achieving state-of-the-art accuracy on datasets like Tiny-ImageNet with low energy consumption.
Authors:Yue Zhu, Haiwen Diao, Shang Gao, Jiazuo Yu, Jiawen Zhu, Yunzhi Zhuge, Shuai Hao, Xu Jia, Lu Zhang, Ying Zhang, Huchuan Lu
Abstract:
Low-Rank Adaptation (LoRA) and its variants have delivered strong capability in Parameter-Efficient Transfer Learning (PETL) by minimizing trainable parameters and benefiting from reparameterization. However, their projection matrices remain unrestricted during training, causing high representation redundancy and diminishing the effectiveness of feature adaptation in the resulting subspaces. While existing methods mitigate this by manually adjusting the rank or implicitly applying channel-wise masks, they lack flexibility and generalize poorly across various datasets and architectures. Hence, we propose ReSoRA, a method that explicitly models redundancy between mapping subspaces and adaptively Regularizes Subspace redundancy of Low-Rank Adaptation. Specifically, it theoretically decomposes the low-rank submatrices into multiple equivalent subspaces and systematically applies de-redundancy constraints to the feature distributions across different projections. Extensive experiments validate that our proposed method consistently facilitates existing state-of-the-art PETL methods across various backbones and datasets in vision-language retrieval and standard visual classification benchmarks. Besides, as a training supervision, ReSoRA can be seamlessly integrated into existing approaches in a plug-and-play manner, with no additional inference costs. Code is publicly available at: https://github.com/Lucenova/ReSoRA.
中文:ReSoRA提出了一种自适应正则化方法,通过显式建模低秩适配投影子空间中的冗余并施加去冗余约束,在保持零推理成本的同时显著提升了跨视觉语言检索与分类任务的特征适应效能。
English: ReSoRA introduces an adaptive regularization method that explicitly reduces redundancy in Low-Rank Adaptation's projection subspaces, enhancing feature adaptation efficiency without increasing inference costs across diverse vision-language and classification tasks.
Authors:Andong Li, Tong Lei, Zhihang Sun, Rilin Chen, Erwei Yin, Xiaodong Li, Chengshi Zheng
Abstract:
Despite the rapid development of neural vocoders in recent years, they usually suffer from some intrinsic challenges like opaque modeling, and parameter-performance trade-off. In this study, we propose an innovative time-frequency (T-F) domain-based neural vocoder to resolve the above-mentioned challenges. To be specific, we bridge the connection between the classical signal range-null decomposition (RND) theory and vocoder task, and the reconstruction of target spectrogram can be decomposed into the superimposition between the range-space and null-space, where the former is enabled by a linear domain shift from the original mel-scale domain to the target linear-scale domain, and the latter is instantiated via a learnable network for further spectral detail generation. Accordingly, we propose a novel dual-path framework, where the spectrum is hierarchically encoded/decoded, and the cross- and narrow-band modules are elaborately devised for efficient sub-band and sequential modeling. Comprehensive experiments are conducted on the LJSpeech and LibriTTS benchmarks. Quantitative and qualitative results show that while enjoying lightweight network parameters, the proposed approach yields state-of-the-art performance among existing advanced methods. Our code and the pretrained model weights are available at https://github.com/Andong-Li-speech/RNDVoC.
中文摘要:本研究提出了一种创新的时频域神经声码器,通过结合经典信号分解理论与双路径框架,在保持轻量参数的同时实现了最先进的语音合成性能,有效解决了传统模型透明度低和参数性能难以兼顾的问题。
English Summary: This study introduces a novel time-frequency domain neural vocoder that overcomes traditional challenges like opaque modeling and parameter-performance trade-offs by integrating classical signal decomposition theory with a dual-path framework for efficient spectral reconstruction.
Authors:Haowen Li, Zhenfeng Fan, Zhang Wen, Zhengzhou Zhu, Yunjin Li
Abstract:
Image composition has advanced significantly with large-scale pre-trained T2I diffusion models. Despite progress in same-domain composition, cross-domain composition remains under-explored. The main challenges are the stochastic nature of diffusion models and the style gap between input images, leading to failures and artifacts. Additionally, heavy reliance on text prompts limits practical applications. This paper presents the first cross-domain image composition method that does not require text prompts, allowing natural stylization and seamless compositions. Our method is efficient and robust, preserving the diffusion prior, as it involves minor steps for backward inversion and forward denoising without training the diffuser. Our method also uses a simple multilayer perceptron network to integrate CLIP features from foreground and background, manipulating diffusion with a local cross-attention strategy. It effectively preserves foreground content while enabling stable stylization without a pre-stylization network. Finally, we create a benchmark dataset with diverse contents and styles for fair evaluation, addressing the lack of testing datasets for cross-domain image composition. Our method outperforms state-of-the-art techniques in both qualitative and quantitative evaluations, significantly improving the LPIPS score by 30.5% and the CSD metric by 18.1%. We believe our method will advance future research and applications. Code and benchmark at https://github.com/sherlhw/AIComposer.
中文摘要:本文提出了首个无需文本提示的跨域图像合成方法,通过多层感知器和局部交叉注意力高效融合前景与背景风格,在定性和定量评估中均显著优于现有技术,并创建了基准数据集以推动该领域研究。
English Summary: This paper introduces the first text-prompt-free cross-domain image composition method that efficiently integrates foreground and background styles using a multilayer perceptron and local cross-attention, achieving superior performance with significant metric improvements.
Authors:Anxiao Song, Shujie Cui, Jianli Bai, Ke Cheng, Yulong Shen, Giovanni Russello
Abstract:
In light of increasing privacy concerns and stringent legal regulations, using secure multiparty computation (MPC) to enable collaborative GBDT model training among multiple data owners has garnered significant attention. Despite this, existing MPC-based GBDT frameworks face efficiency challenges due to high communication costs and the computation burden of non-linear operations, such as division and sigmoid calculations. In this work, we introduce Guard-GBDT, an innovative framework tailored for efficient and privacy-preserving GBDT training on vertical datasets. Guard-GBDT bypasses MPC-unfriendly division and sigmoid functions by using more streamlined approximations and reduces communication overhead by compressing the messages exchanged during gradient aggregation. We implement a prototype of Guard-GBDT and extensively evaluate its performance and accuracy on various real-world datasets. The results show that Guard-GBDT outperforms state-of-the-art HEP-XGB (CIKM'21) and SiGBDT (ASIA CCS'24) by up to $2.71\times$ and $12.21 \times$ on LAN network and up to $2.7\times$ and $8.2\times$ on WAN network. Guard-GBDT also achieves comparable accuracy with SiGBDT and plaintext XGBoost (better than HEP-XGB ), which exhibits a deviation of $\pm1\%$ to $\pm2\%$ only. Our implementation code is provided at https://github.com/XidianNSS/Guard-GBDT.git.
中文: Guard-GBDT是一种高效且保护隐私的GBDT训练框架,通过简化近似替换MPC不友好操作并降低通信成本,在保持相近准确度的同时,相比现有方法实现了显著的性能提升。
English: Guard-GBDT is an efficient and privacy-preserving framework for GBDT training that replaces MPC-unfriendly operations with streamlined approximations and reduces communication costs, achieving significant speed improvements over existing methods while maintaining comparable accuracy.
Authors:Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei
Abstract:
Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at https://github.com/callsys/GMPO.
中文摘要:GMPO通过将GRPO中的算术均值替换为词级奖励的几何均值,有效抑制异常值影响,提升策略优化稳定性,并在数学推理基准测试中显著提高性能。
English Summary: GMPO enhances the stability of GRPO by replacing the arithmetic mean with the geometric mean of token-level rewards, which reduces sensitivity to outliers and improves performance on mathematical reasoning tasks.
Authors:Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei
Abstract:
Group Relative Policy Optimization (GRPO) has significantly enhanced the reasoning capability of large language models by optimizing the arithmetic mean of token-level rewards. Unfortunately, GRPO is observed to suffer from unstable policy updates when facing tokens with outlier importance-weighted rewards, which manifest as extreme importance sampling ratios during training. In this study, we propose Geometric-Mean Policy Optimization (GMPO), with the aim to improve the stability of GRPO through suppressing token reward outliers. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. GMPO is plug-and-play-simply replacing GRPO's arithmetic mean with the geometric mean of token-level rewards, as the latter is inherently less sensitive to outliers. GMPO is theoretically plausible-analysis reveals that both GMPO and GRPO are weighted forms of the policy gradient while the former enjoys more stable weights, which consequently benefits policy optimization and performance. Experiments on multiple mathematical reasoning benchmarks show that GMPO-7B improves the average Pass@1 of GRPO by up to 4.1%, outperforming many state-of-the-art approaches. Code is available at https://github.com/callsys/GMPO.
中文摘要:GMPO通过将GRPO中的算术均值替换为词级奖励的几何均值,有效抑制异常值影响,提升策略优化稳定性,并在数学推理基准测试中显著提高性能。
English Summary: GMPO enhances the stability of GRPO by replacing the arithmetic mean with the geometric mean of token-level rewards, which reduces sensitivity to outliers and improves performance on mathematical reasoning tasks.
Authors:Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang
Abstract:
Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at https://github.com/liaolea/TransPrune.
Chinese: TransPrune是一种无需训练的令牌剪枝方法,通过结合令牌转换变化和指令引导注意力来评估重要性,逐步剪除冗余令牌,在保持多模态性能的同时将推理计算量减半。
English: TransPrune is a training-free token pruning method that enhances inference efficiency in Large Vision-Language Models by progressively removing tokens based on Token Transition Variation and Instruction-Guided Attention, achieving comparable performance with over 50% reduction in computational costs.
Authors:Yanyin Guo, Runxuan An, Junwei Li, Zhiyuan Zhang
Abstract:
Traditional ship detection methods primarily rely on single-modal approaches, such as visible or infrared images, which limit their application in complex scenarios involving varying lighting conditions and heavy fog. To address this issue, we explore the advantages of short-wave infrared (SWIR) and long-wave infrared (LWIR) in ship detection and propose a novel single-stage image fusion detection algorithm called LSFDNet. This algorithm leverages feature interaction between the image fusion and object detection subtask networks, achieving remarkable detection performance and generating visually impressive fused images. To further improve the saliency of objects in the fused images and improve the performance of the downstream detection task, we introduce the Multi-Level Cross-Fusion (MLCF) module. This module combines object-sensitive fused features from the detection task and aggregates features across multiple modalities, scales, and tasks to obtain more semantically rich fused features. Moreover, we utilize the position prior from the detection task in the Object Enhancement (OE) loss function, further increasing the retention of object semantics in the fused images. The detection task also utilizes preliminary fused features from the fusion task to complement SWIR and LWIR features, thereby enhancing detection performance. Additionally, we have established a Nearshore Ship Long-Short Wave Registration (NSLSR) dataset to train effective SWIR and LWIR image fusion and detection networks, bridging a gap in this field. We validated the superiority of our proposed single-stage fusion detection algorithm on two datasets. The source code and dataset are available at https://github.com/Yanyin-Guo/LSFDNet
中文摘要:针对传统单模态船舶检测方法在复杂场景下的局限性,本研究提出LSFDNet单阶段融合检测算法,通过结合短波与长波红外成像、多级交叉融合模块和物体增强损失函数,在新建立的近岸船舶数据集上验证了其优越性能。
English Summary: To overcome the limitations of traditional single-modal ship detection methods in complex conditions, this study introduces LSFDNet, a novel single-stage fusion detection algorithm that integrates short-wave and long-wave infrared imaging with a multi-level cross-fusion module and object enhancement loss, achieving superior performance validated on a newly established dataset.
Authors:Duc-Tai Dinh, Duc Anh Khoa Dinh
Abstract:
We present ZSE-Cap (Zero-Shot Ensemble for Captioning), our 4th place system in Event-Enriched Image Analysis (EVENTA) shared task on article-grounded image retrieval and captioning. Our zero-shot approach requires no finetuning on the competition's data. For retrieval, we ensemble similarity scores from CLIP, SigLIP, and DINOv2. For captioning, we leverage a carefully engineered prompt to guide the Gemma 3 model, enabling it to link high-level events from the article to the visual content in the image. Our system achieved a final score of 0.42002, securing a top-4 position on the private test set, demonstrating the effectiveness of combining foundation models through ensembling and prompting. Our code is available at https://github.com/ductai05/ZSE-Cap.
中文: 我们的ZSE-Cap系统在EVENTA竞赛中获得第四名,通过集成CLIP、SigLIP和DINOv2进行图像检索,并采用精心设计的提示词驱动Gemma 3模型生成事件关联描述,在无需微调的情况下取得0.42002的最终得分。
English: Our ZSE-Cap system ranked 4th in the EVENTA competition by combining CLIP, SigLIP, and DINOv2 for image retrieval and using engineered prompts with Gemma 3 for event-aware captioning, achieving a 0.42002 score without task-specific fine-tuning.
Authors:Chieh-Yun Chen, Min Shi, Gong Zhang, Humphrey Shi
Abstract:
Text-to-Image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1-dev and SD 3.5 Large by 9.11% and 6.36%. Code will be released at: https://github.com/SHI-Labs/T2I-Copilot.
中文: T2I-Copilot是一种免训练的多智能体系统,通过多模态大语言模型协作自动优化提示词和选择模型,在显著提升生成质量与图文对齐度的同时大幅降低成本。
English: T2I-Copilot is a training-free multi-agent system that automates prompt engineering and model selection through collaboration between multimodal large language models, significantly improving generation quality and text-image alignment while reducing costs.
Authors:Yili Li, Gang Xiong, Gaopeng Gou, Xiangyan Qu, Jiamin Zhuang, Zhen Li, Junzheng Shi
Abstract:
Text-to-video retrieval essentially aims to train models to align visual content with textual descriptions accurately. Due to the impressive general multimodal knowledge demonstrated by image-text pretrained models such as CLIP, existing work has primarily focused on extending CLIP knowledge for video-text tasks. However, videos typically contain richer information than images. In current video-text datasets, textual descriptions can only reflect a portion of the video content, leading to partial misalignment in video-text matching. Therefore, directly aligning text representations with video representations can result in incorrect supervision, ignoring the inequivalence of information. In this work, we propose T2VParser to extract multiview semantic representations from text and video, achieving adaptive semantic alignment rather than aligning the entire representation. To extract corresponding representations from different modalities, we introduce Adaptive Decomposition Tokens, which consist of a set of learnable tokens shared across modalities. The goal of T2VParser is to emphasize precise alignment between text and video while retaining the knowledge of pretrained models. Experimental results demonstrate that T2VParser achieves accurate partial alignment through effective cross-modal content decomposition. The code is available at https://github.com/Lilidamowang/T2VParser.
中文: 本研究提出T2VParser方法,通过提取多视角语义表征并采用自适应分解标记,在保持预训练模型知识的同时实现视频与文本的精准局部对齐,有效解决跨模态检索中的信息不对等问题。
English: This study introduces T2VParser, which addresses partial misalignment in video-text retrieval by extracting multiview semantic representations and using Adaptive Decomposition Tokens for precise cross-modal alignment while preserving pretrained model knowledge.
Authors:Binxiong Li, Yuefei Wang, Binyu Zhao, Heyang Gao, Benhan Yang, Quanzhou Luo, Xue Li, Xu Xiang, Yujie Liu, Huijie Tang
Abstract:
This study introduces the Multi-Scale Weight-Based Pairwise Coarsening and Contrastive Learning (MPCCL) model, a novel approach for attributed graph clustering that effectively bridges critical gaps in existing methods, including long-range dependency, feature collapse, and information loss. Traditional methods often struggle to capture high-order graph features due to their reliance on low-order attribute information, while contrastive learning techniques face limitations in feature diversity by overemphasizing local neighborhood structures. Similarly, conventional graph coarsening methods, though reducing graph scale, frequently lose fine-grained structural details. MPCCL addresses these challenges through an innovative multi-scale coarsening strategy, which progressively condenses the graph while prioritizing the merging of key edges based on global node similarity to preserve essential structural information. It further introduces a one-to-many contrastive learning paradigm, integrating node embeddings with augmented graph views and cluster centroids to enhance feature diversity, while mitigating feature masking issues caused by the accumulation of high-frequency node weights during multi-scale coarsening. By incorporating a graph reconstruction loss and KL divergence into its self-supervised learning framework, MPCCL ensures cross-scale consistency of node representations. Experimental evaluations reveal that MPCCL achieves a significant improvement in clustering performance, including a remarkable 15.24% increase in NMI on the ACM dataset and notable robust gains on smaller-scale datasets such as Citeseer, Cora and DBLP.
中文: MPCCL模型通过多尺度粗化和一对多对比学习解决了图聚类中的关键难题,在多个数据集上实现了显著的性能提升。
English: The MPCCL model introduces multi-scale coarsening and one-to-many contrastive learning to overcome limitations in graph clustering, achieving significant performance improvements across multiple datasets.
Authors:Liu Zhang, Oscar Mickelin, Sheng Xu, Amit Singer
Abstract:
Since Pearson [Philosophical Transactions of the Royal Society of London. A, 185 (1894), pp. 71-110] first applied the method of moments (MM) for modeling data as a mixture of one-dimensional Gaussians, moment-based estimation methods have proliferated. Among these methods, the generalized method of moments (GMM) improves the statistical efficiency of MM by weighting the moments appropriately. However, the computational complexity and storage complexity of MM and GMM grow exponentially with the dimension, making these methods impractical for high-dimensional data or when higher-order moments are required. Such computational bottlenecks are more severe in GMM since it additionally requires estimating a large weighting matrix. To overcome these bottlenecks, we propose the diagonally-weighted GMM (DGMM), which achieves a balance among statistical efficiency, computational complexity, and numerical stability. We apply DGMM to study the parameter estimation problem for weakly separated heteroscedastic low-rank Gaussian mixtures and design a computationally efficient and numerically stable algorithm that obtains the DGMM estimator without explicitly computing or storing the moment tensors. We implement the proposed algorithm and empirically validate the advantages of DGMM: in numerical studies, DGMM attains smaller estimation errors while requiring substantially shorter runtime than MM and GMM. The code and data will be available upon publication at https://github.com/liu-lzhang/dgmm.
Chinese: 自皮尔逊提出高斯混合模型的矩方法以来,其计算复杂度随数据维度指数增长,为此我们提出对角加权广义矩估计方法(DGMM),在保证统计效率的同时显著提升计算速度并降低估计误差。
English: Since Pearson introduced the method of moments for Gaussian mixtures, its computational demands have grown exponentially with data dimensions, leading to the proposal of diagonally-weighted GMM (DGMM), which balances efficiency and stability while reducing runtime and estimation errors.
Authors:Camilo Tamayo-Rousseau, Yunjia Zhao, Yiqun Zhang, Randall Balestriero
Abstract:
Self-attention mechanisms are foundational to Transformer architectures, supporting their impressive success in a wide range of tasks. While there are many self-attention variants, their robustness to noise and spurious correlations has not been well studied. This study evaluates Softmax, Sigmoid, Linear, Doubly Stochastic, and Cosine attention within Vision Transformers under different data corruption scenarios. Through testing across the CIFAR-10, CIFAR-100, and Imagenette datasets, we show that Doubly Stochastic attention is the most robust. It consistently outperformed the next best mechanism by $0.1\%-5.1\%$ when training data, or both training and testing data, were corrupted. Our findings inform self-attention selection in contexts with imperfect data. The code used is available at https://github.com/ctamayor/NeurIPS-Robustness-ViT.
Chinese: 研究表明,在多种数据集的不同数据损坏场景下,双随机注意力机制是视觉变换器中最稳健的自注意力方法,始终以0.1%-5.1%的优势优于其他机制。
English: This study demonstrates that Doubly Stochastic attention is the most robust self-attention mechanism in Vision Transformers, consistently outperforming others by 0.1%-5.1% under various data corruption scenarios across multiple datasets.
Authors:Alexandru Brateanu, Raul Balmez, Ciprian Orhei, Codruta Ancuti, Cosmin Ancuti
Abstract:
Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features--including deep feature embeddings, segmentation information, geometric cues, and color information--to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer's state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer.
中文: ModalFormer是首个用于低光照图像增强的大规模多模态框架,通过跨模态转换器和多模态特征融合机制整合九种辅助模态,将RGB数据与分割、几何等上下文特征有效结合,实现了最先进的增强性能。
English: ModalFormer is a pioneering multimodal framework for low-light image enhancement that integrates nine auxiliary modalities through a Cross-modal Transformer and specialized attention mechanisms, achieving state-of-the-art results by effectively combining RGB data with contextual features like segmentation and geometry.
Authors:Hengyu Liu, Tianyi Li, Yuqiang He, Kristian Torp, Yushuai Li, Christian S. Jensen
Abstract:
Location-tracking data from the Automatic Identification System, much of which is publicly available, plays a key role in a range of maritime safety and monitoring applications. However, the data suffers from missing values that hamper downstream applications. Imputing the missing values is challenging because the values of different heterogeneous attributes are updated at diverse rates, resulting in the occurrence of multi-scale dependencies among attributes. Existing imputation methods that assume similar update rates across attributes are unable to capture and exploit such dependencies, limiting their imputation accuracy. We propose MH-GIN, a Multi-scale Heterogeneous Graph-based Imputation Network that aims improve imputation accuracy by capturing multi-scale dependencies. Specifically, MH-GIN first extracts multi-scale temporal features for each attribute while preserving their intrinsic heterogeneous characteristics. Then, it constructs a multi-scale heterogeneous graph to explicitly model dependencies between heterogeneous attributes to enable more accurate imputation of missing values through graph propagation. Experimental results on two real-world datasets find that MH-GIN is capable of an average 57% reduction in imputation errors compared to state-of-the-art methods, while maintaining computational efficiency. The source code and implementation details of MH-GIN are publicly available https://github.com/hyLiu1994/MH-GIN.
中文摘要:MH-GIN是一种基于多尺度异构图的新型网络,通过捕捉不同更新频率属性间的依赖关系,显著提升了海上位置数据缺失值填补的准确性,在保持计算效率的同时比现有方法降低57%的误差。
English Summary: MH-GIN is a novel multi-scale heterogeneous graph-based network that significantly improves maritime location data imputation accuracy by capturing dependencies between attributes with different update rates, achieving 57% lower errors than existing methods while remaining computationally efficient.
Authors:Zheng Wei, Hongtao Wu, Lvmin Zhang, Xian Xu, Yefeng Zheng, Pan Hui, Maneesh Agrawala, Huamin Qu, Anyi Rao
Abstract:
Effective communication between directors and cinematographers is fundamental in film production, yet traditional approaches relying on visual references and hand-drawn storyboards often lack the efficiency and precision necessary during pre-production. We present CineVision, an AI-driven platform that integrates scriptwriting with real-time visual pre-visualization to bridge this communication gap. By offering dynamic lighting control, style emulation based on renowned filmmakers, and customizable character design, CineVision enables directors to convey their creative vision with heightened clarity and rapidly iterate on scene composition. In a 24-participant lab study, CineVision yielded shorter task times and higher usability ratings than two baseline methods, suggesting a potential to ease early-stage communication and accelerate storyboard drafts under controlled conditions. These findings underscore CineVision's potential to streamline pre-production processes and foster deeper creative synergy among filmmaking teams, particularly for new collaborators. Our code and demo are available at https://github.com/TonyHongtaoWu/CineVision.
中文: CineVision 是一个人工智能驱动的平台,通过将剧本创作与实时可视化预览相结合,提供动态灯光控制和风格模拟功能,有效改善导演与摄影师之间的沟通,提升前期制作效率和创意协作。
English: CineVision is an AI-driven platform that enhances director-cinematographer communication by integrating scriptwriting with real-time visual pre-visualization, featuring dynamic lighting and style emulation to streamline pre-production and improve creative synergy.
Authors:Peng Liu, Bianca Güttner, Yutong Su, Chenyang Li, Jinjing Xu, Mingyang Liu, Zhe Min, Andrey Zhylka, Jasper Smit, Karin Olthof, Matteo Fusaglia, Rudi Apolle, Matthias Miederer, Laura Frohneberger, Carina Riediger, Jügen Weitz, Fiona Kolbinger, Stefanie Speidel, Micha Pfeiffer
Abstract:
Non-rigid registration is essential for Augmented Reality guided laparoscopic liver surgery by fusing preoperative information, such as tumor location and vascular structures, into the limited intraoperative view, thereby enhancing surgical navigation. A prerequisite is the accurate prediction of intraoperative liver deformation which remains highly challenging due to factors such as large deformation caused by pneumoperitoneum, respiration and tool interaction as well as noisy intraoperative data, and limited field of view due to occlusion and constrained camera movement. To address these challenges, we introduce PIVOTS, a Preoperative to Intraoperative VOlume-To-Surface registration neural network that directly takes point clouds as input for deformation prediction. The geometric feature extraction encoder allows multi-resolution feature extraction, and the decoder, comprising novel deformation aware cross attention modules, enables pre- and intraoperative information interaction and accurate multi-level displacement prediction. We train the neural network on synthetic data simulated from a biomechanical simulation pipeline and validate its performance on both synthetic and real datasets. Results demonstrate superior registration performance of our method compared to baseline methods, exhibiting strong robustness against high amounts of noise, large deformation, and various levels of intraoperative visibility. We publish the training and test sets as evaluation benchmarks and call for a fair comparison of liver registration methods with volume-to-surface data. Code and datasets are available here https://github.com/pengliu-nct/PIVOTS.
中文: PIVOTS是一种用于非刚性配准的神经网络,通过点云预测肝脏变形,在腹腔镜手术中对噪声和大变形表现出强大的鲁棒性。
English: PIVOTS is a neural network for non-rigid registration that predicts liver deformation using point clouds, achieving robust performance against noise and large deformations in laparoscopic surgery.
Authors:Stepan Dergachev, Konstantin Yakovlev
Abstract:
Decentralized multi-agent navigation under uncertainty is a complex task that arises in numerous robotic applications. It requires collision avoidance strategies that account for both kinematic constraints, sensing and action execution noise. In this paper, we propose a novel approach that integrates the Model Predictive Path Integral (MPPI) with a probabilistic adaptation of Optimal Reciprocal Collision Avoidance. Our method ensures safe and efficient multi-agent navigation by incorporating probabilistic safety constraints directly into the MPPI sampling process via a Second-Order Cone Programming formulation. This approach enables agents to operate independently using local noisy observations while maintaining safety guarantees. We validate our algorithm through extensive simulations with differential-drive robots and benchmark it against state-of-the-art methods, including ORCA-DD and B-UAVC. Results demonstrate that our approach outperforms them while achieving high success rates, even in densely populated environments. Additionally, validation in the Gazebo simulator confirms its practical applicability to robotic platforms. A source code is available at http://github.com/PathPlanning/MPPI-Collision-Avoidance.
Chinese: 本文提出了一种将模型预测路径积分与概率优化互惠避障相结合的新方法,通过概率安全约束在不确定条件下实现安全高效的多智能体导航,并在仿真中验证了其优于现有方法的性能。
English: This paper introduces a novel method combining Model Predictive Path Integral with a probabilistic version of Optimal Reciprocal Collision Avoidance, ensuring safe and efficient multi-agent navigation under uncertainty through probabilistic safety constraints and outperforming existing approaches in simulations.
Authors:Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Yuhui Wu, Lei Zhang
Abstract:
Impressive results on real-world image super-resolution (Real-ISR) have been achieved by employing pre-trained stable diffusion (SD) models. However, one critical issue of such methods lies in their poor reconstruction of image fine structures, such as small characters and textures, due to the aggressive resolution reduction of the VAE (eg., 8$\times$ downsampling) in the SD model. One solution is to employ a VAE with a lower downsampling rate for diffusion; however, adapting its latent features with the pre-trained UNet while mitigating the increased computational cost poses new challenges. To address these issues, we propose a Transfer VAE Training (TVT) strategy to transfer the 8$\times$ downsampled VAE into a 4$\times$ one while adapting to the pre-trained UNet. Specifically, we first train a 4$\times$ decoder based on the output features of the original VAE encoder, then train a 4$\times$ encoder while keeping the newly trained decoder fixed. Such a TVT strategy aligns the new encoder-decoder pair with the original VAE latent space while enhancing image fine details. Additionally, we introduce a compact VAE and compute-efficient UNet by optimizing their network architectures, reducing the computational cost while capturing high-resolution fine-scale features. Experimental results demonstrate that our TVT method significantly improves fine-structure preservation, which is often compromised by other SD-based methods, while requiring fewer FLOPs than state-of-the-art one-step diffusion models. The official code can be found at https://github.com/Joyies/TVT.
Chinese: 提出的转移VAE训练(TVT)策略通过将8倍下采样VAE转换为4倍版本并优化网络架构,有效提升了真实图像超分辨率中精细结构的重建能力,同时降低了计算成本。
English: The proposed Transfer VAE Training (TVT) strategy enhances fine-structure preservation in real-world image super-resolution by converting an 8× downsampling VAE to a 4× version and optimizing network architectures to reduce computational costs.
Authors:Dingkun Liu, Zhu Chen, Jingwei Luo, Shijie Lian, Dongrui Wu
Abstract:
Brain-computer interfaces (BCIs) enable direct communication between the brain and external devices. Recent EEG foundation models aim to learn generalized representations across diverse BCI paradigms. However, these approaches overlook fundamental paradigm-specific neurophysiological distinctions, limiting their generalization ability. Importantly, in practical BCI deployments, the specific paradigm such as motor imagery (MI) for stroke rehabilitation or assistive robotics, is generally determined prior to data acquisition. This paper proposes MIRepNet, the first EEG foundation model tailored for the MI paradigm. MIRepNet comprises a high-quality EEG preprocessing pipeline incorporating a neurophysiologically-informed channel template, adaptable to EEG headsets with arbitrary electrode configurations. Furthermore, we introduce a hybrid pretraining strategy that combines self-supervised masked token reconstruction and supervised MI classification, facilitating rapid adaptation and accurate decoding on novel downstream MI tasks with fewer than 30 trials per class. Extensive evaluations across five public MI datasets demonstrated that MIRepNet consistently achieved state-of-the-art performance, significantly outperforming both specialized and generalized EEG models. Our code will be available on GitHub\footnote{https://github.com/staraink/MIRepNet}.
Chinese: 本文提出了首个专为运动想象范式设计的EEG基础模型MIRepNet,它采用神经生理学启发的预处理流程和混合预训练策略,在多个数据集上以极少数据需求实现了最先进的性能表现。
English: This paper introduces MIRepNet, the first EEG foundation model specifically designed for motor imagery paradigms, featuring a neurophysiologically-informed preprocessing pipeline and hybrid pretraining that achieves state-of-the-art performance across multiple datasets with minimal data requirements.
Authors:Lang Yu, Zhangyang Gao, Cheng Tan, Qin Chen, Jie Zhou, Liang He
Abstract:
SE(3)-based generative models have shown great promise in protein geometry modeling and effective structure design. However, the field currently lacks a modularized benchmark to enable comprehensive investigation and fair comparison of different methods. In this paper, we propose Protein-SE(3), a new benchmark based on a unified training framework, which comprises protein scaffolding tasks, integrated generative models, high-level mathematical abstraction, and diverse evaluation metrics. Recent advanced generative models designed for protein scaffolding, from multiple perspectives like DDPM (Genie1 and Genie2), Score Matching (FrameDiff and RfDiffusion) and Flow Matching (FoldFlow and FrameFlow) are integrated into our framework. All integrated methods are fairly investigated with the same training dataset and evaluation metrics. Furthermore, we provide a high-level abstraction of the mathematical foundations behind the generative models, enabling fast prototyping of future algorithms without reliance on explicit protein structures. Accordingly, we release the first comprehensive benchmark built upon unified training framework for SE(3)-based protein structure design, which is publicly accessible at https://github.com/BruthYU/protein-se3.
中文:本文提出了Protein-SE(3)基准,为基于SE(3)的蛋白质结构生成模型建立了统一训练框架下的模块化评估体系,整合了多种先进方法并提供了高层数学抽象以支持未来算法快速开发。
English: The paper introduces Protein-SE(3), a modular benchmark for fair comparison of SE(3)-based generative models in protein structure design, integrating diverse methods and providing mathematical abstraction for future algorithm development.
Authors:Ruizi Yang, Xiaolu Liu, Junbo Chen, Jianke Zhu
Abstract:
High-definition (HD) maps are essential for autonomous driving, as they provide precise road information for downstream tasks. Recent advances highlight the potential of temporal modeling in addressing challenges like occlusions and extended perception range. However, existing methods either fail to fully exploit temporal information or incur substantial computational overhead in handling extended sequences. To tackle these challenges, we propose MambaMap, a novel framework that efficiently fuses long-range temporal features in the state space to construct online vectorized HD maps. Specifically, MambaMap incorporates a memory bank to store and utilize information from historical frames, dynamically updating BEV features and instance queries to improve robustness against noise and occlusions. Moreover, we introduce a gating mechanism in the state space, selectively integrating dependencies of map elements in high computational efficiency. In addition, we design innovative multi-directional and spatial-temporal scanning strategies to enhance feature extraction at both BEV and instance levels. These strategies significantly boost the prediction accuracy of our approach while ensuring robust temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed MambaMap approach outperforms state-of-the-art methods across various splits and perception ranges. Source code will be available at https://github.com/ZiziAmy/MambaMap.
中文摘要:MambaMap是一种创新框架,通过内存库和门控机制高效融合长时序特征来构建稳健的在线矢量化高精地图,在精度和一致性方面均优于现有最优方法。
English Summary: MambaMap is a novel framework that efficiently fuses long-range temporal features using a memory bank and gating mechanism to construct robust online vectorized HD maps, outperforming state-of-the-art methods in accuracy and consistency.
Authors:Minh Hoang Nguyen, Thuat Thien Nguyen, Minh Nhat Ta
Abstract:
News recommendation systems play a vital role in mitigating information overload by delivering personalized news content. A central challenge is to effectively model both multi-view news representations and the dynamic nature of user interests, which often span both short- and long-term preferences. Existing methods typically rely on single-view features of news articles (e.g., titles or categories) or fail to comprehensively capture user preferences across time scales. In this work, we propose Co-NAML-LSTUR, a hybrid news recommendation framework that integrates NAML for attentive multi-view news modeling and LSTUR for capturing both long- and short-term user representations. Our model also incorporates BERT-based word embeddings to enhance semantic feature extraction. We evaluate Co-NAML-LSTUR on two widely used benchmarks, MIND-small and MIND-large. Experimental results show that Co-NAML-LSTUR achieves substantial improvements over most state-of-the-art baselines on MIND-small and MIND-large, respectively. These results demonstrate the effectiveness of combining multi-view news representations with dual-scale user modeling. The implementation of our model is publicly available at https://github.com/MinhNguyenDS/Co-NAML-LSTUR.
Chinese: 本文提出Co-NAML-LSTUR混合新闻推荐框架,通过结合多视角新闻编码与分层用户建模来捕捉用户短期和长期兴趣,在基准测试中显著优于现有基线模型,同时针对有限数据资源进行了优化设计。
English: This paper introduces Co-NAML-LSTUR, a hybrid news recommendation framework that integrates multi-view news encoding with hierarchical user modeling to capture both short- and long-term preferences, demonstrating superior performance over existing baselines on benchmark datasets while being optimized for limited data resources.
Authors:Minh Hoang Nguyen, Thuat Thien Nguyen, Minh Nhat Ta, Tung Le, Huy Tien Nguyen
Abstract:
News recommendation systems play a critical role in alleviating information overload by delivering personalized content. A key challenge lies in jointly modeling multi-view representations of news articles and capturing the dynamic, dual-scale nature of user interests-encompassing both short- and long-term preferences. Prior methods often rely on single-view features or insufficiently model user behavior across time. In this work, we introduce Co-NAML-LSTUR, a hybrid news recommendation framework that integrates NAML for attentive multi-view news encoding and LSTUR for hierarchical user modeling, designed for training on limited data resources. Our approach leverages BERT-based embeddings to enhance semantic representation. We evaluate Co-NAML-LSTUR on two widely used benchmarks, MIND-small and MIND-large. Results show that our model significantly outperforms strong baselines, achieving improvements over NRMS by 1.55% in AUC and 1.15% in MRR, and over NAML by 2.45% in AUC and 1.71% in MRR. These findings highlight the effectiveness of our efficiency-focused hybrid model, which combines multi-view news modeling with dual-scale user representations for practical, resource-limited resources rather than a claim to absolute state-of-the-art (SOTA). The implementation of our model is publicly available at https://github.com/MinhNguyenDS/Co-NAML-LSTUR
Chinese: 本文提出Co-NAML-LSTUR混合新闻推荐框架,通过结合多视角新闻编码与分层用户建模来捕捉用户短期和长期兴趣,在基准测试中显著优于现有基线模型,同时针对有限数据资源进行了优化设计。
English: This paper introduces Co-NAML-LSTUR, a hybrid news recommendation framework that integrates multi-view news encoding with hierarchical user modeling to capture both short- and long-term preferences, demonstrating superior performance over existing baselines on benchmark datasets while being optimized for limited data resources.
Authors:Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang
Abstract:
Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain. We also maintain a public repository to continuously track and update the latest advances in this promising area.
中文: 多模态大语言模型处理长上下文时面临计算瓶颈,催生了按模态和机制分类的令牌压缩方法,旨在提升效率并指引未来研究方向。
English: Multimodal large language models face computational challenges from long contexts, prompting token compression methods categorized by modality and mechanism to enhance efficiency and guide future research.
Authors:Li Jinfu, Song Hong, Xia Jianghan, Lin Yucong, Wang Ting, Shao Long, Fan Jingfan, Yang Jian
Abstract:
While illumination changes inevitably affect the quality of infrared and visible image fusion, many outstanding methods still ignore this factor and directly merge the information from source images, leading to modality bias in the fused results. To this end, we propose a dynamic multi-level image fusion network called MoCTEFuse, which applies an illumination-gated Mixture of Chiral Transformer Experts (MoCTE) to adaptively preserve texture details and object contrasts in balance. MoCTE consists of high- and low-illumination expert subnetworks, each built upon the Chiral Transformer Fusion Block (CTFB). Guided by the illumination gating signals, CTFB dynamically switches between the primary and auxiliary modalities as well as assigning them corresponding weights with its asymmetric cross-attention mechanism. Meanwhile, it is stacked at multiple stages to progressively aggregate and refine modality-specific and cross-modality information. To facilitate robust training, we propose a competitive loss function that integrates illumination distributions with three levels of sub-loss terms. Extensive experiments conducted on the DroneVehicle, MSRS, TNO and RoadScene datasets show MoCTEFuse's superior fusion performance. Finally, it achieves the best detection mean Average Precision (mAP) of 70.93% on the MFNet dataset and 45.14% on the DroneVehicle dataset. The code and model are released at https://github.com/Bitlijinfu/MoCTEFuse.
中文摘要:提出的MoCTEFuse网络通过光照门控的专家变换器动态平衡红外与可见光图像融合,采用竞争性损失函数在多个数据集上实现了优越性能。
English Summary: The proposed MoCTEFuse network dynamically balances infrared and visible image fusion through illumination-gated transformer experts, achieving superior performance across multiple datasets with a competitive loss function.
Authors:Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, Rongrong Ji
Abstract:
We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called {\modaltracker}). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: \textbf{Video-level Sampling}. We expand the model's inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. \textbf{Video-level Association}. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. \textbf{Modality Scalable}. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our {\modaltracker} achieves a new \textit{SOTA} performance. The code will be available at https://github.com/GXNU-ZhongLab/ODTrack.
中文: 提出的通用视频级模态感知跟踪模型{\modaltracker}采用在线密集时间令牌学习和一次性训练方案,以同一架构处理多种跟踪任务,在各类基准测试中均实现了最优性能。
English: The proposed universal video-level modality-awareness tracking model, called {\modaltracker}, employs online dense temporal token learning and a one-shot training scheme to handle multiple tracking tasks with the same architecture, achieving state-of-the-art performance across various benchmarks.
Authors:Fei Kong, Jinhao Duan, Kaidi Xu, Zhenhua Guo, Xiaofeng Zhu, Xiaoshuang Shi
Abstract:
Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement. In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark. Specifically, we categorize spatial understanding into two main types: absolute spatial understanding, which involves querying the absolute spatial position (e.g., left, right) of an object within an image, and 3D spatial understanding, which includes movement and rotation. Notably, our dataset is entirely synthetic, enabling the generation of test samples at a low cost while also preventing dataset contamination. We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement in their spatial understanding abilities. Explicitly, in our experiments, humans achieve near-perfect performance on all tasks, whereas current VLMs attain human-level performance only on the two simplest tasks. For the remaining tasks, the performance of VLMs is distinctly lower than that of humans. In fact, the best-performing Vision-Language Models even achieve near-zero scores on multiple tasks. The dataset and code are available on https://github.com/kong13661/LRR-Bench.
Chinese: 本研究引入了一个合成基准来评估视觉语言模型的空间理解能力,发现尽管人类在所有任务中表现出色,但现有模型明显落后,尤其在复杂空间推理方面。
English: This study introduces a synthetic benchmark to evaluate Vision-Language Models' spatial understanding, revealing that while humans excel across tasks, current VLMs significantly lag, especially in complex spatial reasoning.
Authors:Zeyu Xi, Haoying Sun, Yaofei Wu, Junchi Yan, Haoran Zhang, Lifang Wu, Liang Wang, Changwen Chen
Abstract:
Existing sports video captioning methods often focus on the action yet overlook player identities, limiting their applicability. Although some methods integrate extra information to generate identity-aware descriptions, the player identities are sometimes incorrect because the extra information is independent of the video content. This paper proposes a player-centric multimodal prompt generation network for identity-aware sports video captioning (LLM-IAVC), which focuses on recognizing player identities from a visual perspective. Specifically, an identity-related information extraction module (IRIEM) is designed to extract player-related multimodal embeddings. IRIEM includes a player identification network (PIN) for extracting visual features and player names, and a bidirectional semantic interaction module (BSIM) to link player features with video content for mutual enhancement. Additionally, a visual context learning module (VCLM) is designed to capture the key video context information. Finally, by integrating the outputs of the above modules as the multimodal prompt for the large language model (LLM), it facilitates the generation of descriptions with player identities. To support this work, we construct a new benchmark called NBA-Identity, a large identity-aware basketball video captioning dataset with 9,726 videos covering 9 major event types. The experimental results on NBA-Identity and VC-NBA-2022 demonstrate that our proposed model achieves advanced performance. Code and dataset are publicly available at https://github.com/Zeyu1226-mt/LLM-IAVC.
Chinese: 本文提出了一种以球员为中心的多模态提示生成网络(LLM-IAVC),通过提取视频中的身份相关信息来生成准确的带身份描述,在新建的NBA-Identity数据集上验证了其先进性能。
English: This paper introduces a player-centric multimodal prompt generation network (LLM-IAVC) that extracts identity-related information from sports videos to generate accurate identity-aware captions, validated on the new NBA-Identity dataset with state-of-the-art results.
Authors:Cheng Huang, Fan Gao, Yutong Liu, Yadi Liu, Xiaoli Ma, Ye Aung Moe, Yuhan Zhang, Yao Ma, Hao Wang, Xiangxiang Wang, Yongbin Yu
Abstract:
Insider trading violations, particularly delayed disclosures of Form 4 filings, remain a persistent challenge for financial market surveillance. Despite regulatory requirements such as the two-business-day rule of the Securities and Exchange Commission (SEC), enforcement is limited by the lack of large-scale, labeled datasets and task-specific benchmarks. In this paper, we introduce the Insider Filing Delay (IFD) dataset, the first and largest publicly available resource for insider disclosure behavior, comprising over one million Form 4 transactions spanning two decades (2002 to 2025), with structured annotations on delay status, insider roles, governance factors, and firm-level financial indicators. IFD enables the first large-scale formulation of strategic disclosure violation detection as a binary classification task grounded in regulatory compliance. To demonstrate the utility of IFD, we propose MaBoost, a hybrid framework combining a Mamba-based state space encoder with XGBoost, achieving high accuracy and interpretability in identifying high-risk behavioral patterns. Experiments across statistical baselines, deep learning models, and large language models confirm that MaBoost outperforms prior approaches, achieving an F1 score of up to 99.47 percent under constrained regulatory settings. IFD provides a realistic, reproducible, and behavior-rich dataset for developing AI models in financial compliance, regulatory forensics, and interpretable time series classification. All data and codes are available at: https://github.com/CH-YellowOrange/MaBoost-and-IFD.
Chinese: 本文介绍了内幕交易申报延迟(IFD)数据集,这是最大的公开内幕披露行为资源,并提出了MaBoost混合框架,在检测策略性披露违规方面实现了高精度,以高达99.47%的F1分数超越了现有方法。
English: The paper introduces the Insider Filing Delay (IFD) dataset, the largest public resource for insider trading disclosure behavior, and proposes MaBoost, a hybrid framework that achieves high accuracy in detecting strategic disclosure violations, outperforming existing methods with an F1 score up to 99.47%.
Authors:Yuhong Zhang, Liyao Wang, Han Wang, Danni Wu, Zuzeng Lin, Feng Wang, Li Song
Abstract:
Animation colorization plays a vital role in animation production, yet existing methods struggle to achieve color accuracy and temporal consistency. To address these challenges, we propose \textbf{AnimeColor}, a novel reference-based animation colorization framework leveraging Diffusion Transformers (DiT). Our approach integrates sketch sequences into a DiT-based video diffusion model, enabling sketch-controlled animation generation. We introduce two key components: a High-level Color Extractor (HCE) to capture semantic color information and a Low-level Color Guider (LCG) to extract fine-grained color details from reference images. These components work synergistically to guide the video diffusion process. Additionally, we employ a multi-stage training strategy to maximize the utilization of reference image color information. Extensive experiments demonstrate that AnimeColor outperforms existing methods in color accuracy, sketch alignment, temporal consistency, and visual quality. Our framework not only advances the state of the art in animation colorization but also provides a practical solution for industrial applications. The code will be made publicly available at \href{https://github.com/IamCreateAI/AnimeColor}{https://github.com/IamCreateAI/AnimeColor}.
中文: AnimeColor框架采用扩散变换器,结合高级颜色提取器和低级颜色引导器,通过基于参考的指导和多阶段训练实现了卓越的动画上色效果,在色彩准确性和时间一致性方面表现出优越性能。
English: The proposed AnimeColor framework utilizes Diffusion Transformers with a High-level Color Extractor and Low-level Color Guider to achieve superior animation colorization through reference-based guidance and multi-stage training, demonstrating enhanced performance in color accuracy and temporal consistency.
Authors:Daulet Toibazar, Kesen Wang, Sherif Mohamed, Abdulaziz Al-Badawi, Abdulrahman Alfulayt, Pedro J. Moreno
Abstract:
Vision-language models (VLMs) extend the conventional large language models by integrating visual data, enabling richer multimodal reasoning and significantly broadens the practical applications of AI. However, including visual inputs also brings new challenges in maintaining data quality. Empirical evidence consistently shows that carefully curated and representative training examples often yield superior results compared to simply increasing the quantity of data. Inspired by this observation, we introduce a streamlined data filtration framework that employs a compact VLM, fine-tuned on a high-quality image-caption annotated dataset. This model effectively evaluates and filters potential training samples based on caption and image quality and alignment. Unlike previous approaches, which typically add auxiliary filtration modules on top of existing full-scale VLMs, our method exclusively utilizes the inherent evaluative capability of a purpose-built small VLM. This strategy eliminates the need for extra modules and reduces training overhead. Our lightweight model efficiently filters out inaccurate, noisy web data, improving image-text alignment and caption linguistic fluency. Experimental results show that datasets underwent high-precision filtration using our compact VLM perform on par with, or even surpass, larger and noisier datasets gathered through high-volume web crawling. Thus, our method provides a lightweight yet robust solution for building high-quality vision-language training corpora. \\ \textbf{Availability and implementation:} Our compact VLM filtration model, training data, utility scripts, and Supplementary data (Appendices) are freely available at https://github.com/daulettoibazar/Compact_VLM_Filter.
中文: 本研究提出了一种轻量级数据过滤框架,利用紧凑型视觉语言模型过滤网络噪声数据以提升训练质量,在降低计算开销的同时,其过滤后的数据集性能可媲美甚至优于大规模采集数据。
English: This study introduces a lightweight data filtration framework using a compact vision-language model to enhance training data quality by filtering noisy web data, achieving performance comparable to or better than larger datasets while reducing computational overhead.
Authors:Jingxi Liao, Shijie Hao, Richang Hong, Meng Wang
Abstract:
Low-light image enhancement (LLIE) aims to improve the visual quality of images captured under poor lighting conditions. In supervised LLIE research, there exists a significant yet often overlooked inconsistency between the overall brightness of an enhanced image and its ground truth counterpart, referred to as brightness mismatch in this study. Brightness mismatch negatively impact supervised LLIE models by misleading model training. However, this issue is largely neglected in current research. In this context, we propose the GT-mean loss, a simple yet effective loss function directly modeling the mean values of images from a probabilistic perspective. The GT-mean loss is flexible, as it extends existing supervised LLIE loss functions into the GT-mean form with minimal additional computational costs. Extensive experiments demonstrate that the incorporation of the GT-mean loss results in consistent performance improvements across various methods and datasets.
中文: 本研究针对弱光图像增强中常被忽视的亮度失配问题,提出了一种灵活的GT-mean损失函数,能以最小计算成本持续提升各种方法和数据集的性能表现。
English: This study addresses the overlooked issue of brightness mismatch in supervised low-light image enhancement (LLIE) by proposing a flexible GT-mean loss function that consistently improves performance across methods and datasets with minimal computational overhead.
Authors:Kesen Wang, Daulet Toibazar, Abdulrahman Alfulayt, Abdulaziz S. Albadawi, Ranya A. Alkahtani, Asma A. Ibrahim, Haneen A. Alhomoud, Sherif Mohamed, Pedro J. Moreno
Abstract:
Document Understanding (DU) in long-contextual scenarios with complex layouts remains a significant challenge in vision-language research. Although Large Vision-Language Models (LVLMs) excel at short-context DU tasks, their performance declines in long-context settings. A key limitation is the scarcity of fine-grained training data, particularly for low-resource languages such as Arabic. Existing state-of-the-art techniques rely heavily on human annotation, which is costly and inefficient. We propose a fully automated, multi-agent interactive framework to generate long-context questions efficiently. Our approach efficiently generates high-quality single- and multi-page questions for extensive English and Arabic documents, covering hundreds of pages across diverse domains. This facilitates the development of LVLMs with enhanced long-context understanding ability. Experimental results in this work have shown that our generated English and Arabic questions (\textbf{AraEngLongBench}) are quite challenging to major open- and close-source LVLMs. The code and data proposed in this work can be found in https://github.com/wangk0b/Multi_Agentic_QA_Long_Doc.git. Sample Question and Answer (QA) pairs and structured system prompts can be found in the Appendix.
中文: 本文提出了一种全自动多智能体交互框架,能够高效生成长篇英文和阿拉伯文文档的复杂问答数据,解决了细粒度训练数据稀缺问题,显著提升了大规模视觉语言模型的长文档理解能力。
English: This paper introduces an automated multi-agent framework that generates challenging long-context questions for English and Arabic documents, addressing the scarcity of training data and enhancing large vision-language models' document understanding capabilities.
Authors:Zeyi Liu, Songqiao Hu, Pengyu Han, Jiaming Liu, Xiao He
Abstract:
In recent years, online learning has attracted increasing attention due to its adaptive capability to process streaming and non-stationary data. To facilitate algorithm development and practical deployment in this area, we introduce Awesome-OL, an extensible Python toolkit tailored for online learning research. Awesome-OL integrates state-of-the-art algorithm, which provides a unified framework for reproducible comparisons, curated benchmark datasets, and multi-modal visualization. Built upon the scikit-multiflow open-source infrastructure, Awesome-OL emphasizes user-friendly interactions without compromising research flexibility or extensibility. The source code is publicly available at: https://github.com/liuzy0708/Awesome-OL.
中文:Awesome-OL 是一款专为在线学习研究设计的 Python 工具包,集成了先进算法、基准数据集和可视化工具,以支持可重复比较和灵活部署。
English: Awesome-OL is a Python toolkit designed for online learning research, integrating advanced algorithms, benchmark datasets, and visualization tools to support reproducible comparisons and flexible deployment.
Authors:Baiyu Chen, Wilson Wongso, Xiaoqian Hu, Yue Tan, Flora Salim
Abstract:
This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM) challenge. The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate, especially when faced with egocentric imagery, long-tail entities, and complex, multi-hop questions. This issue is particularly problematic in real-world applications where users pose fact-seeking queries that demand high factual accuracy across diverse modalities. To tackle this, we propose a robust, multi-stage framework that prioritizes factual accuracy and truthfulness over completeness. Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification. This conservative strategy is designed to minimize hallucinations, which incur a severe penalty in the competition's scoring metric. Our approach achieved 3rd place in Task 1, demonstrating the effectiveness of prioritizing answer reliability in complex multi-modal RAG systems. Our implementation is available at https://github.com/Breezelled/KDD-Cup-2025-Meta-CRAG-MM .
中文: 本文介绍了CRUISE团队针对KDD Cup 2025多模态对话基准挑战提出的解决方案,该方案通过查询路由、检索、双路径生成和后验验证的多阶段框架,优先保证事实准确性,有效减少了视觉语言模型的幻觉问题,最终获得任务第三名。
English: This paper introduces CRUISE team's third-place winning solution for the KDD Cup 2025 CRAG-MM challenge, featuring a multi-stage framework that prioritizes factual accuracy through query routing, retrieval, dual-path generation, and verification to minimize hallucinations in Vision Language Models.
Authors:Shizuka Akahori, Shotaro Teruya, Pragyan Shrestha, Yuichi Yoshii, Satoshi Iizuka, Akira Ikumi, Hiromitsu Tsuge, Itaru Kitahara
Abstract:
This study proposes a reconstruction-based framework for detecting medial epicondyle avulsion in elbow ultrasound images, trained exclusively on normal cases. Medial epicondyle avulsion, commonly observed in baseball players, involves bone detachment and deformity, often appearing as discontinuities in bone contour. Therefore, learning the structure and continuity of normal bone is essential for detecting such abnormalities. To achieve this, we propose a masked autoencoder-based, structure-aware reconstruction framework that learns the continuity of normal bone structures. Even in the presence of avulsion, the model attempts to reconstruct the normal structure, resulting in large reconstruction errors at the avulsion site. For evaluation, we constructed a novel dataset comprising normal and avulsion ultrasound images from 16 baseball players, with pixel-level annotations under orthopedic supervision. Our method outperformed existing approaches, achieving a pixel-wise AUC of 0.965 and an image-wise AUC of 0.967. The dataset is publicly available at: https://github.com/Akahori000/Ultrasound-Medial-Epicondyle-Avulsion-Dataset.
本研究提出了一种基于掩码自编码器的重建框架,通过仅学习正常骨骼结构来检测肘部超声图像中的肱骨内上髁撕脱,在新公开数据集上取得了超过0.96的AUC值。
This study introduces a reconstruction-based method using masked autoencoders to detect medial epicondyle avulsion in elbow ultrasound by learning normal bone structures, achieving high accuracy with AUC scores over 0.96 on a new public dataset.
Authors:Shizuka Akahori, Shotaro Teruya, Pragyan Shrestha, Yuichi Yoshii, Satoshi Iizuka, Akira Ikumi, Hiromitsu Tsuge, Itaru Kitahara
Abstract:
This study proposes a reconstruction-based framework for detecting medial epicondyle avulsion in elbow ultrasound images, trained exclusively on normal cases. Medial epicondyle avulsion, commonly observed in baseball players, involves bone detachment and deformity, often appearing as discontinuities in bone contour. Therefore, learning the structure and continuity of normal bone is essential for detecting such abnormalities. To achieve this, we propose a masked autoencoder-based, structure-aware reconstruction framework that learns the continuity of normal bone structures. Even in the presence of avulsion, the model attempts to reconstruct the normal structure, resulting in large reconstruction errors at the avulsion site. For evaluation, we constructed a novel dataset comprising normal and avulsion ultrasound images from 16 baseball players, with pixel-level annotations under orthopedic supervision. Our method outperformed existing approaches, achieving a pixel-wise AUC of 0.965 and an image-wise AUC of 0.967. The dataset is publicly available at: https://github.com/Akahori000/Ultrasound-Medial-Epicondyle-Avulsion-Dataset.
本研究提出了一种基于掩码自编码器的重建框架,通过仅学习正常骨骼结构来检测肘部超声图像中的肱骨内上髁撕脱,在新公开数据集上取得了超过0.96的AUC值。
This study introduces a reconstruction-based method using masked autoencoders to detect medial epicondyle avulsion in elbow ultrasound by learning normal bone structures, achieving high accuracy with AUC scores over 0.96 on a new public dataset.
Authors:Haoyue Li, Di Wu
Abstract:
Hyperspectral image denoising faces the challenge of multi-dimensional coupling of spatially non-uniform noise and spectral correlation interference. Existing deep learning methods mostly focus on RGB images and struggle to effectively handle the unique spatial-spectral characteristics and complex noise distributions of hyperspectral images (HSI). This paper proposes an HSI denoising framework, Hybrid-Domain Synergistic Transformer Network (HDST), based on frequency domain enhancement and multiscale modeling, achieving three-dimensional collaborative processing of spatial, frequency and channel domains. The method innovatively integrates three key mechanisms: (1) introducing an FFT preprocessing module with multi-band convolution to extract cross-band correlations and decouple spectral noise components; (2) designing a dynamic cross-domain attention module that adaptively fuses spatial domain texture features and frequency domain noise priors through a learnable gating mechanism; (3) building a hierarchical architecture where shallow layers capture global noise statistics using multiscale atrous convolution, and deep layers achieve detail recovery through frequency domain postprocessing. Experiments on both real and synthetic datasets demonstrate that HDST significantly improves denoising performance while maintaining computational efficiency, validating the effectiveness of the proposed method. This research provides new insights and a universal framework for addressing complex noise coupling issues in HSI and other high-dimensional visual data. The code is available at https://github.com/lhy-cn/HDST-HSIDenoise.
中文摘要:本文提出HDST混合域协同Transformer网络,通过频域增强与多尺度建模实现空间、频率和通道的三维协同处理,在真实与合成数据集上均显著提升了高光谱图像去噪性能。
English Summary: This paper introduces HDST, a hybrid-domain transformer network that effectively denoises hyperspectral images by synergistically processing spatial, frequency, and channel domains through frequency enhancement and multiscale modeling, demonstrating superior performance on both real and synthetic datasets.
Authors:Ran Xu, Yuchen Zhuang, Yue Yu, Haoyu Wang, Wenqi Shi, Carl Yang
Abstract:
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieved at inference time. While RAG demonstrates strong performance on benchmarks largely derived from general-domain corpora like Wikipedia, its effectiveness under realistic, diverse retrieval scenarios remains underexplored. We evaluated RAG systems using MassiveDS, a large-scale datastore with mixture of knowledge, and identified critical limitations: retrieval mainly benefits smaller models, rerankers add minimal value, and no single retrieval source consistently excels. Moreover, current LLMs struggle to route queries across heterogeneous knowledge sources. These findings highlight the need for adaptive retrieval strategies before deploying RAG in real-world settings. Our code and data can be found at https://github.com/ritaranx/RAG_in_the_Wild.
中文: RAG通过外部知识增强大语言模型,但在多样化现实场景中效果有限,如对大模型提升不足、跨异构知识源的查询路由困难,需开发自适应检索策略才能实际应用。
English: RAG enhances LLMs with external knowledge but faces limitations in diverse real-world scenarios, such as limited benefits for larger models and poor query routing across heterogeneous sources, necessitating adaptive strategies before deployment.
Authors:Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Roy Miles, Ismail Elezi, Jiankang Deng
Abstract:
Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). The pre-trained models have been released at https://github.com/deepglint/MVT.
中文: RICE是一种新颖方法,通过构建大规模区域数据集并采用统一的聚类判别损失,提升了区域级视觉和OCR能力,在分割和检测任务中持续优于先前方法。
English: RICE is a novel method that enhances region-level vision and OCR capabilities by constructing a large-scale region dataset and employing a unified cluster discrimination loss, consistently outperforming previous methods in segmentation and detection tasks.
Authors:Lehan Wang, Hualiang Wang, Chubin Ou, Lushi Chen, Yunyi Liang, Xiaomeng Li
Abstract:
Cardiovascular disease (CVD) remains the leading cause of death worldwide, requiring urgent development of effective risk assessment methods for timely intervention. While current research has introduced non-invasive and efficient approaches to predict CVD risk from retinal imaging with deep learning models, the commonly used fundus photographs and Optical Coherence Tomography (OCT) fail to capture detailed vascular features critical for CVD assessment compared with OCT angiography (OCTA) images. Moreover, existing methods typically classify CVD risk only as high or low, without providing a deeper analysis on CVD-related blood factor conditions, thus limiting prediction accuracy and clinical utility. As a result, we propose a novel multi-purpose paradigm of CVD risk assessment that jointly performs CVD risk and CVD-related condition prediction, aligning with clinical experiences. Based on this core idea, we introduce OCTA-CVD, the first OCTA dataset for CVD risk assessment, and a Vessel-Aware Mamba-based Prediction model with Informative Enhancement (VAMPIRE) based on OCTA enface images. Our proposed model aims to extract crucial vascular characteristics through two key components: (1) a Mamba-Based Directional (MBD) Module that captures fine-grained vascular trajectory features and (2) an Information-Enhanced Morphological (IEM) Module that incorporates comprehensive vessel morphology knowledge. Experimental results demonstrate that our method can surpass standard classification backbones, OCTA-based detection methods, and ophthalmologic foundation models. Our codes and the collected OCTA-CVD dataset are available at https://github.com/xmed-lab/VAMPIRE.
中文摘要:本研究提出了一种基于OCTA成像的新型心血管疾病风险评估多用途范式,通过VAMPIRE模型提取关键血管特征,在预测精度上超越了现有方法。
English Summary: This study introduces a novel multi-purpose paradigm for cardiovascular disease (CVD) risk assessment using OCTA imaging, featuring the VAMPIRE model that extracts detailed vascular features through specialized modules to surpass existing methods in prediction accuracy.
Authors:Liu junkang, Yuanyuan Liu, Fanhua Shang, Hongying Liu, Jin Liu, Wei Feng
Abstract:
For federated learning (FL) algorithms such as FedSAM, their generalization capability is crucial for real-word applications. In this paper, we revisit the generalization problem in FL and investigate the impact of data heterogeneity on FL generalization. We find that FedSAM usually performs worse than FedAvg in the case of highly heterogeneous data, and thus propose a novel and effective federated learning algorithm with Stochastic Weight Averaging (called \texttt{FedSWA}), which aims to find flatter minima in the setting of highly heterogeneous data. Moreover, we introduce a new momentum-based stochastic controlled weight averaging FL algorithm (\texttt{FedMoSWA}), which is designed to better align local and global models.
Theoretically, we provide both convergence analysis and generalization bounds for \texttt{FedSWA} and \texttt{FedMoSWA}. We also prove that the optimization and generalization errors of \texttt{FedMoSWA} are smaller than those of their counterparts, including FedSAM and its variants. Empirically, experimental results on CIFAR10/100 and Tiny ImageNet demonstrate the superiority of the proposed algorithms compared to their counterparts. Open source code at: https://github.com/junkangLiu0/FedSWA.
中文摘要:本文针对联邦学习中数据高度异构的问题,提出了FedSWA和FedMoSWA两种新算法,通过随机权重平均技术寻找更平坦的最小值,在理论和实验上均证明其比现有方法具有更优的泛化性能。
English Summary: This paper proposes two novel federated learning algorithms, FedSWA and FedMoSWA, which utilize stochastic weight averaging to find flatter minima and improve generalization performance under highly heterogeneous data conditions, demonstrating superior theoretical and empirical results compared to existing methods.
Authors:Padmavathi Moorthy
Abstract:
Precise fare prediction is crucial in ride-hailing platforms and urban mobility systems. This study examines three machine learning models-Graph Attention Networks (GAT), XGBoost, and TimesNet to evaluate their predictive capabilities for taxi fares using a real-world dataset comprising over 55 million records. Both raw (noisy) and denoised versions of the dataset are analyzed to assess the impact of data quality on model performance. The study evaluated the models along multiple axes, including predictive accuracy, calibration, uncertainty estimation, out-of-distribution (OOD) robustness, and feature sensitivity. We also explore pre-processing strategies, including KNN imputation, Gaussian noise injection, and autoencoder-based denoising. The study reveals critical differences between classical and deep learning models under realistic conditions, offering practical guidelines for building robust and scalable models in urban fare prediction systems.
中文: 本研究通过超过5500万条真实数据评估了GAT、XGBoost和TimesNet三种机器学习模型在出租车费预测中的表现,从准确性、鲁棒性和数据质量多维度对比分析,为城市交通系统提供了实用的建模指导。
English: This study evaluates three machine learning models—GAT, XGBoost, and TimesNet—for taxi fare prediction using a large real-world dataset, analyzing their performance across accuracy, robustness, and data quality while providing practical guidelines for urban mobility systems.
Authors:Hao-Yu Hou, Chun-Yi Lee, Motoharu Sonogashira, Yasutomo Kawanishi
Abstract:
The ability to abstract complex 3D environments into simplified and structured representations is crucial across various domains. 3D semantic scene graphs (SSGs) achieve this by representing objects as nodes and their interrelationships as edges, facilitating high-level scene understanding. Existing methods for 3D SSG generation, however, face significant challenges, including high computational demands and non-incremental processing that hinder their suitability for real-time open-world applications. To address this issue, we propose FROSS (Faster-than-Real-Time Online 3D Semantic Scene Graph Generation), an innovative approach for online and faster-than-real-time 3D SSG generation that leverages the direct lifting of 2D scene graphs to 3D space and represents objects as 3D Gaussian distributions. This framework eliminates the dependency on precise and computationally-intensive point cloud processing. Furthermore, we extend the Replica dataset with inter-object relationship annotations, creating the ReplicaSSG dataset for comprehensive evaluation of FROSS. The experimental results from evaluations on ReplicaSSG and 3DSSG datasets show that FROSS can achieve superior performance while operating significantly faster than prior 3D SSG generation methods. Our implementation and dataset are publicly available at https://github.com/Howardkhh/FROSS.
中文摘要:FROSS提出了一种超实时生成3D语义场景图的新方法,通过将2D场景图转换为3D高斯表示来避免密集点云处理,在速度和性能上均优于现有方法。
English Summary: FROSS introduces a faster-than-real-time method for generating 3D semantic scene graphs by converting 2D graphs into 3D Gaussian representations, eliminating intensive point cloud processing and outperforming existing approaches in speed and performance.
Authors:Mizanur Rahman, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque
Abstract:
Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural language, the absence of comprehensive benchmarks limits the rigorous evaluation of their capabilities. We introduce Text2Vis, a benchmark designed to assess text-to-visualization models, covering 20+ chart types and diverse data science queries, including trend analysis, correlation, outlier detection, and predictive analytics. It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts. The queries involve complex reasoning, conversational turns, and dynamic data retrieval. We benchmark 11 open-source and closed-source models, revealing significant performance gaps, highlighting key challenges, and offering insights for future advancements. To close this gap, we propose the first cross-modal actor-critic agentic framework that jointly refines the textual answer and visualization code, increasing GPT-4o`s pass rate from 26% to 42% over the direct approach and improving chart quality. We also introduce an automated LLM-based evaluation framework that enables scalable assessment across thousands of samples without human annotation, measuring answer correctness, code execution success, visualization readability, and chart accuracy. We release Text2Vis at https://github.com/vis-nlp/Text2Vis.
中文: 本文提出Text2Vis基准,用于全面评估文本到可视化模型在多种图表类型和数据查询中的表现,揭示了性能差距并提出智能体框架,显著提升GPT-4o性能并实现自动化评估。
English: This paper introduces Text2Vis, a comprehensive benchmark for evaluating text-to-visualization models across diverse chart types and data queries, revealing performance gaps and proposing an agentic framework that significantly improves GPT-4o's performance while enabling automated evaluation.
Authors:Cesar Kadir Torrico Villanueva, Jiaxin Cindy Tu, Mihir Tripathy, Connor Lane, Rishab Iyer, Paul S. Scotti
Abstract:
We present MedARC's team solution to the Algonauts 2025 challenge. Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5-Omni). These features extracted from the models were linearly projected to a latent space, temporally aligned to the fMRI time series, and finally mapped to cortical parcels through a lightweight encoder comprising a shared group head plus subject-specific residual heads. We trained hundreds of model variants across hyperparameter settings, validated them on held-out movies and assembled ensembles targeted to each parcel in each subject. Our final submission achieved a mean Pearson's correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition. We further discuss a last-minute optimization that would have raised us to second place. Our results highlight how combining features from models trained in different modalities, using a simple architecture consisting of shared-subject and single-subject components, and conducting comprehensive model selection and ensembling improves generalization of encoding models to novel movie stimuli. All code is available on GitHub.
中文: MedARC团队在Algonauts 2025挑战赛中通过整合多模态预训练模型特征,采用共享组头与个体残差头的轻量编码器架构,结合大规模超参数调优与集成学习,在测试集上获得0.2085平均皮尔逊相关系数,最终位列第四。
English: MedARC's fourth-place solution for the Algonauts 2025 challenge combined multimodal features from state-of-the-art models using a lightweight encoder with shared and subject-specific components, achieving strong generalization through extensive hyperparameter tuning and ensemble methods.
Authors:Chengyu Zheng, Jin Huang, Honghua Chen, Mingqiang Wei
Abstract:
Recent research leveraging large-scale pretrained diffusion models has demonstrated the potential of using diffusion features to establish semantic correspondences in images. Inspired by advancements in diffusion-based techniques, we propose a novel zero-shot method for refining point cloud registration algorithms. Our approach leverages correspondences derived from depth images to enhance point feature representations, eliminating the need for a dedicated training dataset. Specifically, we first project the point cloud into depth maps from multiple perspectives and extract implicit knowledge from a pretrained diffusion network as depth diffusion features. These features are then integrated with geometric features obtained from existing methods to establish more accurate correspondences between point clouds. By leveraging these refined correspondences, our approach achieves significantly improved registration accuracy. Extensive experiments demonstrate that our method not only enhances the performance of existing point cloud registration techniques but also exhibits robust generalization capabilities across diverse datasets. Codes are available at https://github.com/zhengcy-lambo/RARE.git.
中文: 本文提出一种零样本方法,通过将预训练模型的深度扩散特征与几何特征相结合来优化点云配准,无需训练数据即可显著提升配准精度和泛化能力。
English: This paper introduces a zero-shot method that refines point cloud registration by integrating depth diffusion features from pretrained models with geometric features, achieving enhanced accuracy and generalization without requiring training data.
Authors:Qingqing Fang, Wenxi Lv, Qinliang Su
Abstract:
Visual anomaly detection has been widely used in industrial inspection and medical diagnosis. Existing methods typically demand substantial training samples, limiting their utility in zero-/few-shot scenarios. While recent efforts have leveraged CLIP's zero-shot recognition capability for this task, they often ignore optimizing visual features to focus on local anomalies, reducing their efficacy. In this work, we propose AF-CLIP (Anomaly-Focused CLIP) by dramatically enhancing its visual representations to focus on local defects. Our approach introduces a lightweight adapter that emphasizes anomaly-relevant patterns in visual features, simultaneously optimizing both class-level features for image classification and patch-level features for precise localization. To capture anomalies of different sizes and improve detection accuracy, prior to the adapter, we develop a multi-scale spatial aggregation mechanism to effectively consolidate neighborhood context. Complementing these visual enhancements, we design learnable textual prompts that generically characterize normal and abnormal states. After optimization on auxiliary datasets using a composite objective function, AF-CLIP demonstrates strong zero-shot detection capability. Our method is also extended to few-shot scenarios by extra memory banks. Experimental results across diverse industrial and medical datasets demonstrate the effectiveness and generalization of our proposed method. Code is available at https://github.com/Faustinaqq/AF-CLIP.
中文摘要:本研究提出的AF-CLIP方法通过轻量级适配器和多尺度聚合机制增强CLIP的视觉表征能力,在零样本场景下能更精准地检测局部异常,在工业和医疗数据集上均展现出优异性能。
English Summary: The proposed AF-CLIP method enhances CLIP's visual representations through a lightweight adapter and multi-scale aggregation to better detect local anomalies in zero-shot scenarios, demonstrating strong performance across industrial and medical datasets.
Authors:Zimin Chen, Yue Pan, Siyu Lu, Jiayi Xu, Claire Le Goues, Martin Monperrus, He Ye
Abstract:
Language model (LM) agents, such as SWE-agent and OpenHands, have made progress toward automated issue resolution. However, existing approaches are often limited to Python-only issues and rely on pre-constructed containers in SWE-bench with reproduced issues, restricting their applicability to real-world and work for multi-language repositories. We present Prometheus, designed to resolve real-world issues beyond benchmark settings. Prometheus is a multi-agent system that transforms an entire code repository into a unified knowledge graph to guide context retrieval for issue resolution. Prometheus encodes files, abstract syntax trees, and natural language text into a graph of typed nodes and five general edge types to support multiple programming languages. Prometheus uses Neo4j for graph persistence, enabling scalable and structured reasoning over large codebases. Integrated by the DeepSeek-V3 model, Prometheus resolves 28.67% and 13.7% of issues on SWE-bench Lite and SWE-bench Multilingual, respectively, with an average API cost of $0.23 and $0.38 per issue. Prometheus resolves 10 unique issues not addressed by prior work and is the first to demonstrate effectiveness across seven programming languages. Moreover, it shows the ability to resolve real-world GitHub issues in the LangChain and OpenHands repositories. We have open-sourced Prometheus at: https://github.com/Pantheon-temple/Prometheus
中文:Prometheus是一个多智能体系统,通过将整个代码库转化为统一知识图谱来解决多编程语言的实际问题,在基准测试中表现优异,并成功处理了真实GitHub仓库中的问题。
English: Prometheus is a multi-agent system that converts entire code repositories into unified knowledge graphs to resolve real-world issues across multiple programming languages, achieving significant success on benchmark tests while demonstrating practical effectiveness on real GitHub repositories.
Authors:Qing Xu, Yanming Chen, Yue Li, Ziyu Liu, Zhenye Lou, Yixuan Zhang, Xiangjian He
Abstract:
Medical image segmentation plays an important role in computer-aided diagnosis. Traditional convolution-based U-shape segmentation architectures are usually limited by the local receptive field. Existing vision transformers have been widely applied to diverse medical segmentation frameworks due to their superior capabilities of capturing global contexts. Despite the advantage, the real-world application of vision transformers is challenged by their non-linear self-attention mechanism, requiring huge computational costs. To address this issue, the selective state space model (SSM) Mamba has gained recognition for its adeptness in modeling long-range dependencies in sequential data, particularly noted for its efficient memory costs. In this paper, we propose MambaVesselNet++, a Hybrid CNN-Mamba framework for medical image segmentation. Our MambaVesselNet++ is comprised of a hybrid image encoder (Hi-Encoder) and a bifocal fusion decoder (BF-Decoder). In Hi-Encoder, we first devise the texture-aware layer to capture low-level semantic features by leveraging convolutions. Then, we utilize Mamba to effectively model long-range dependencies with linear complexity. The Bi-Decoder adopts skip connections to combine local and global information of the Hi-Encoder for the accurate generation of segmentation masks. Extensive experiments demonstrate that MambaVesselNet++ outperforms current convolution-based, transformer-based, and Mamba-based state-of-the-arts across diverse medical 2D, 3D, and instance segmentation tasks. The code is available at https://github.com/CC0117/MambaVesselNet.
Chinese: MambaVesselNet++ 是一种混合CNN-Mamba框架,通过创新的编码器-解码器设计有效整合局部纹理特征与全局依赖关系,在线性计算复杂度下实现了医学图像分割的卓越性能。
English: MambaVesselNet++ is a hybrid CNN-Mamba framework that effectively integrates local texture features and global dependencies through its innovative encoder-decoder design, achieving superior performance in medical image segmentation with linear computational complexity.
Authors:Parsa Vares, Ãloi Durant, Jun Pang, Nicolas Médoc, Mohammad Ghoniem
Abstract:
Thompson Sampling (TS) and its variants are powerful Multi-Armed Bandit algorithms used to balance exploration and exploitation strategies in active learning. Yet, their probabilistic nature often turns them into a "black box", hindering debugging and trust. We introduce TS-Insight, a visual analytics tool explicitly designed to shed light on the internal decision mechanisms of Thompson Sampling-based algorithms, for model developers. It comprises multiple plots, tracing for each arm the evolving posteriors, evidence counts, and sampling outcomes, enabling the verification, diagnosis, and explainability of exploration/exploitation dynamics. This tool aims at fostering trust and facilitating effective debugging and deployment in complex binary decision-making scenarios especially in sensitive domains requiring interpretable decision-making.
中文: TS-Insight是一款可视化分析工具,通过多图展示汤普森采样算法的内部决策机制,增强信任并促进在敏感领域中的有效调试。
English: TS-Insight is a visual analytics tool that reveals the internal decision mechanisms of Thompson Sampling algorithms through multiple plots, enhancing trust and enabling effective debugging in sensitive domains.
Authors:Xiaohua Feng, Jiaming Zhang, Fengyuan Yu, Chengye Wang, Li Zhang, Kaixiang Li, Yuyuan Li, Chaochao Chen, Jianwei Yin
Abstract:
With the rapid advancement of generative models, associated privacy concerns have attracted growing attention. To address this, researchers have begun adapting machine unlearning techniques from traditional classification models to generative settings. Although notable progress has been made in this area, a unified framework for systematically organizing and integrating existing work is still lacking. The substantial differences among current studies in terms of unlearning objectives and evaluation protocols hinder the objective and fair comparison of various approaches. While some studies focus on specific types of generative models, they often overlook the commonalities and systematic characteristics inherent in Generative Model Unlearning (GenMU). To bridge this gap, we provide a comprehensive review of current research on GenMU and propose a unified analytical framework for categorizing unlearning objectives, methodological strategies, and evaluation metrics. In addition, we explore the connections between GenMU and related techniques, including model editing, reinforcement learning from human feedback, and controllable generation. We further highlight the potential practical value of unlearning techniques in real-world applications. Finally, we identify key challenges and outline future research directions aimed at laying a solid foundation for further advancements in this field. We consistently maintain the related open-source materials at https://github.com/caxLee/Generative-model-unlearning-survey.
中文: 本文提出生成模型遗忘的统一分析框架,系统梳理了目标、方法和评估指标,并探讨了实际应用价值与未来研究方向。
English: This review introduces a unified framework for generative model unlearning, categorizing objectives, methods, and evaluations while highlighting practical applications and future challenges.
Authors:Drandreb Earl O. Juanico, Rowel O. Atienza, Jeffrey Kenneth Go
Abstract:
We propose Reverse Contrast Attention (RCA), a plug-in method that enhances object localization in vision-language transformers without retraining. RCA reweights final-layer attention by suppressing extremes and amplifying mid-level activations to let semantically relevant but subdued tokens guide predictions. We evaluate it on Open Vocabulary Referring Object Detection (OV-RefOD), introducing FitAP, a confidence-free average precision metric based on IoU and box area. RCA improves FitAP in 11 out of 15 open-source VLMs, with gains up to $+26.6\%$. Effectiveness aligns with attention sharpness and fusion timing; while late-fusion models benefit consistently, models like $\texttt{DeepSeek-VL2}$ also improve, pointing to capacity and disentanglement as key factors. RCA offers both interpretability and performance gains for multimodal transformers. Codes and dataset are available from https://github.com/earl-juanico/rca
Chinese Summary: 反向对比注意力(RCA)是一种无需重新训练的即插即用方法,通过重新加权注意力机制增强中层激活来提升视觉语言Transformer中的目标定位能力,在开放词汇参考目标检测任务中最高可实现26.6%的性能提升。
English Summary: Reverse Contrast Attention (RCA) is a plug-in method that improves object localization in vision-language transformers by reweighting attention to enhance mid-level activations, achieving performance gains up to 26.6% on Open Vocabulary Referring Object Detection without requiring retraining.
Authors:X. Feng, S. Hu, X. Li, D. Zhang, M. Wu, J. Zhang, X. Chen, K. Huang
Abstract:
Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words pertain to the target or the context, complicating the utilization of textual cues. In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. Specifically, (1) for the visual modality, we propose an effective temporal visual target-context modeling approach that provides the tracker with timely visual cues. (2) For the textual modality, we achieve precise target words identification solely based on textual content, and design an innovative context words calibration method to adaptively utilize auxiliary context words. (3) We conduct extensive experiments on mainstream benchmarks and ATCTrack achieves a new SOTA performance. The code and models will be released at: https://github.com/XiaokunFeng/ATCTrack.
中文: ATCTrack是一种新型视觉语言跟踪器,通过全面的目标-上下文特征建模,动态对齐多模态线索与变化的目标状态,从而在复杂长期场景中实现鲁棒跟踪。
English: ATCTrack is a novel vision-language tracker that enhances robustness in complex long-term scenarios by dynamically aligning multimodal cues with evolving target states through comprehensive target-context feature modeling.
Authors:Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou
Abstract:
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO
中文: ARPO是一种创新的强化学习算法,通过基于熵的自适应执行机制动态平衡探索与利用,显著提升大语言模型在多轮工具交互中的表现,在多个推理基准测试中以更少的工具使用预算实现了更优性能。
English: ARPO is a novel reinforcement learning algorithm that enhances large language models' performance in multi-turn tool interactions by dynamically balancing exploration and exploitation through an entropy-based adaptive rollout mechanism, achieving superior results with reduced tool-use budgets across various reasoning benchmarks.
Authors:Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang
Abstract:
Out-of-distribution (OOD) detection is crucial for building reliable machine learning models. Although negative prompt tuning has enhanced the OOD detection capabilities of vision-language models, these tuned models often suffer from reduced generalization performance on unseen classes and styles. To address this challenge, we propose a novel method called Knowledge Regularized Negative Feature Tuning (KR-NFT), which integrates an innovative adaptation architecture termed Negative Feature Tuning (NFT) and a corresponding knowledge-regularization (KR) optimization strategy. Specifically, NFT applies distribution-aware transformations to pre-trained text features, effectively separating positive and negative features into distinct spaces. This separation maximizes the distinction between in-distribution (ID) and OOD images. Additionally, we introduce image-conditional learnable factors through a lightweight meta-network, enabling dynamic adaptation to individual images and mitigating sensitivity to class and style shifts. Compared to traditional negative prompt tuning, NFT demonstrates superior efficiency and scalability. To optimize this adaptation architecture, the KR optimization strategy is designed to enhance the discrimination between ID and OOD sets while mitigating pre-trained knowledge forgetting. This enhances OOD detection performance on trained ID classes while simultaneously improving OOD detection on unseen ID datasets. Notably, when trained with few-shot samples from ImageNet dataset, KR-NFT not only improves ID classification accuracy and OOD detection but also significantly reduces the FPR95 by 5.44\% under an unexplored generalization setting with unseen ID categories. Codes can be found at \href{https://github.com/ZhuWenjie98/KRNFT}.
中文: 提出的KR-NFT方法通过分布感知变换分离正负特征,并结合知识正则化优化策略,有效提升了视觉语言模型的分布外检测能力,同时增强了已知类别的分类准确性和对未知类别的泛化性能。
English: The proposed KR-NFT method enhances OOD detection in vision-language models by separating positive and negative features through distribution-aware transformations and a knowledge-regularized optimization strategy, improving both ID classification and generalization to unseen categories.
Authors:Yanrui Yu, Tianfei Zhou, Jiaxin Sun, Lianpeng Qiao, Lizhong Ding, Ye Yuan, Guoren Wang
Abstract:
In modern urban environments, camera networks generate massive amounts of operational footage -- reaching petabytes each day -- making scalable video analytics essential for efficient processing. Many existing approaches adopt an SQL-based paradigm for querying such large-scale video databases; however, this constrains queries to rigid patterns with predefined semantic categories, significantly limiting analytical flexibility. In this work, we explore a language-driven video analytics paradigm aimed at enabling flexible and efficient querying of high-volume video data driven by natural language. Particularly, we build \textsc{Lava}, a system that accepts natural language queries and retrieves traffic targets across multiple levels of granularity and arbitrary categories. \textsc{Lava} comprises three main components: 1) a multi-armed bandit-based efficient sampling method for video segment-level localization;
2) a video-specific open-world detection module for object-level retrieval; and 3) a long-term object trajectory extraction scheme for temporal object association, yielding complete trajectories for object-of-interests. To support comprehensive evaluation, we further develop a novel benchmark by providing diverse, semantically rich natural language predicates and fine-grained annotations for multiple videos. Experiments on this benchmark demonstrate that \textsc{Lava} improves $F_1$-scores for selection queries by $\mathbf{14\%}$, reduces MPAE for aggregation queries by $\mathbf{0.39}$, and achieves top-$k$ precision of $\mathbf{86\%}$, while processing videos $ \mathbf{9.6\times} $ faster than the most accurate baseline. Our code and dataset are available at https://github.com/yuyanrui/LAVA.
中文: 本文提出Lava系统,通过自然语言查询实现大规模视频数据的灵活高效分析,在查询精度和处理速度上显著优于现有方法。
English: This paper introduces Lava, a language-driven video analytics system that enables flexible and efficient querying of large-scale video data using natural language, significantly outperforming existing methods in accuracy and processing speed.
Authors:Guiping Cao, Xiangyuan Lan, Wenjian Huang, Jianguo Zhang, Dongmei Jiang, Yaowei Wang
Abstract:
Popular transformer detectors have achieved promising performance through query-based learning using attention mechanisms. However, the roles of existing decoder query types (e.g., content query and positional query) are still underexplored. These queries are generally predefined with a fixed number (fixed-query), which limits their flexibility. We find that the learning of these fixed-query is impaired by Recurrent Opposing inTeractions (ROT) between two attention operations: Self-Attention (query-to-query) and Cross-Attention (query-to-encoder), thereby degrading decoder efficiency. Furthermore, "query ambiguity" arises when shared-weight decoder layers are processed with both one-to-one and one-to-many label assignments during training, violating DETR's one-to-one matching principle. To address these challenges, we propose DS-Det, a more efficient detector capable of detecting a flexible number of objects in images. Specifically, we reformulate and introduce a new unified Single-Query paradigm for decoder modeling, transforming the fixed-query into flexible. Furthermore, we propose a simplified decoder framework through attention disentangled learning: locating boxes with Cross-Attention (one-to-many process), deduplicating predictions with Self-Attention (one-to-one process), addressing "query ambiguity" and "ROT" issues directly, and enhancing decoder efficiency. We further introduce a unified PoCoo loss that leverages box size priors to prioritize query learning on hard samples such as small objects. Extensive experiments across five different backbone models on COCO2017 and WiderPerson datasets demonstrate the general effectiveness and superiority of DS-Det. The source codes are available at https://github.com/Med-Process/DS-Det/.
中文摘要:提出的DS-Det通过引入灵活单查询范式和解耦注意力学习,解决了Transformer检测器中查询歧义和循环对抗交互的问题,实现了更优的目标检测性能。
English Summary: The proposed DS-Det addresses limitations in transformer detectors by introducing a flexible single-query paradigm and disentangled attention learning to resolve query ambiguity and recurrent opposing interactions, achieving superior object detection performance.
Authors:Peng Cai, Qiang Li, Kaicheng Yang, Dong Guo, Jia Li, Nan Zhou, Xiang An, Ninghua Yang, Jiankang Deng
Abstract:
Document image rectification aims to eliminate geometric deformation in photographed documents to facilitate text recognition. However, existing methods often neglect the significance of foreground elements, which provide essential geometric references and layout information for document image correction. In this paper, we introduce Foreground-Centric Network (ForCenNet) to eliminate geometric distortions in document images. Specifically, we initially propose a foreground-centric label generation method, which extracts detailed foreground elements from an undistorted image. Then we introduce a foreground-centric mask mechanism to enhance the distinction between readable and background regions. Furthermore, we design a curvature consistency loss to leverage the detailed foreground labels to help the model understand the distorted geometric distribution. Extensive experiments demonstrate that ForCenNet achieves new state-of-the-art on four real-world benchmarks, such as DocUNet, DIR300, WarpDoc, and DocReal. Quantitative analysis shows that the proposed method effectively undistorts layout elements, such as text lines and table borders. The resources for further comparison are provided at https://github.com/caipeng328/ForCenNet.
中文: 本文提出的ForCenNet通过聚焦前景元素并采用曲率一致性损失,有效消除文档图像的几何变形,在多个基准测试中达到了最先进的性能水平。
English: This paper introduces ForCenNet, a foreground-centric network that effectively eliminates geometric distortions in document images by leveraging detailed foreground elements and a curvature consistency loss, achieving state-of-the-art performance on multiple benchmarks.
Authors:Tianxiang Chen, Zhentao Tan, Xiaofan Bo, Yue Wu, Tao Gong, Qi Chu, Jieping Ye, Nenghai Yu
Abstract:
Effectively handling long contexts is challenging for Large Language Models (LLMs) due to the rarity of long texts, high computational demands, and substantial forgetting of short-context abilities. Recent approaches have attempted to construct long contexts for instruction tuning, but these methods often require LLMs or human interventions, which are both costly and limited in length and diversity. Also, the drop in short-context performances of present long-context LLMs remains significant. In this paper, we introduce Flora, an effortless (human/LLM-free) long-context construction strategy. Flora can markedly enhance the long-context performance of LLMs by arbitrarily assembling short instructions based on categories and instructing LLMs to generate responses based on long-context meta-instructions. This enables Flora to produce contexts of arbitrary length and scale with rich diversity, while only slightly compromising short-context performance. Experiments on Llama3-8B-Instruct and QwQ-32B show that LLMs enhanced by Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks. Our data-construction code is available at \href{https://github.com/txchen-USTC/Flora}{https://github.com/txchen-USTC/Flora}.
Chinese: Flora 是一种无需人工或大语言模型干预的高效策略,通过组合短指令并基于元指令生成响应,显著提升大语言模型的长上下文处理能力,在基准测试中表现优异,同时对短上下文性能影响甚微。
English: Flora is an efficient, human/LLM-free strategy that enhances LLMs' long-context performance by assembling short instructions and using meta-instructions, achieving superior results in benchmarks with minimal impact on short-context abilities.
Authors:Tianxiang Chen, Zhentao Tan, Xiaofan Bo, Yue Wu, Tao Gong, Qi Chu, Jieping Ye, Nenghai Yu
Abstract:
Effectively handling long contexts is challenging for Large Language Models (LLMs) due to the rarity of long texts, high computational demands, and substantial forgetting of short-context abilities. Recent approaches have attempted to construct long contexts for instruction tuning, but these methods often require LLMs or human interventions, which are both costly and limited in length and diversity. Also, the drop in short-context performances of present long-context LLMs remains significant. In this paper, we introduce Flora, an effortless (human/LLM-free) long-context construction strategy. Flora can markedly enhance the long-context performance of LLMs by arbitrarily assembling short instructions based on categories and instructing LLMs to generate responses based on long-context meta-instructions. This enables Flora to produce contexts of arbitrary length and scale with rich diversity, while only slightly compromising short-context performance. Experiments on Llama3-8B-Instruct and QwQ-32B show that LLMs enhanced by Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks. Our data-construction code is available at \href{https://github.com/txchen-USTC/Flora}{https://github.com/txchen-USTC/Flora}.
Chinese: Flora 是一种无需人工或大语言模型干预的高效策略,通过组合短指令并基于元指令生成响应,显著提升大语言模型的长上下文处理能力,在基准测试中表现优异,同时对短上下文性能影响甚微。
English: Flora is an efficient, human/LLM-free strategy that enhances LLMs' long-context performance by assembling short instructions and using meta-instructions, achieving superior results in benchmarks with minimal impact on short-context abilities.
Authors:Kanglin Qu, Pan Gao, Qun Dai, Yuanhao Sun
Abstract:
The attention mechanism has become a dominant operator in point cloud learning, but its quadratic complexity leads to limited inter-point interactions, hindering long-range dependency modeling between objects. Due to excellent long-range modeling capability with linear complexity, the selective state space model (S6), as the core of Mamba, has been exploited in point cloud learning for long-range dependency interactions over the entire point cloud. Despite some significant progress, related works still suffer from imperfect point cloud serialization and lack of locality learning. To this end, we explore a state space model-based point cloud network termed HydraMamba to address the above challenges. Specifically, we design a shuffle serialization strategy, making unordered point sets better adapted to the causal nature of S6. Meanwhile, to overcome the deficiency of existing techniques in locality learning, we propose a ConvBiS6 layer, which is capable of capturing local geometries and global context dependencies synergistically. Besides, we propose MHS6 by extending the multi-head design to S6, further enhancing its modeling capability. HydraMamba achieves state-of-the-art results on various tasks at both object-level and scene-level. The code is available at https://github.com/Point-Cloud-Learning/HydraMamba.
中文:HydraMamba通过引入随机序列化策略和ConvBiS6层,有效提升了点云学习中长程依赖建模和局部几何特征捕获能力,在多项任务中取得了领先性能。
English: HydraMamba introduces a shuffle serialization strategy and a ConvBiS6 layer to enhance point cloud learning by improving long-range dependency modeling and local geometry capture, achieving state-of-the-art results across tasks.
Authors:Seunghun Lee, Jiwan Seo, Minwoo Choi, Kiljoon Han, Jaehoon Jeong, Zane Durante, Ehsan Adeli, Sang Hyun Park, Sunghoon Im
Abstract:
In this paper, we present Latest Object Memory Management (LOMM) for temporally consistent video instance segmentation that significantly improves long-term instance tracking. At the core of our method is Latest Object Memory (LOM), which robustly tracks and continuously updates the latest states of objects by explicitly modeling their presence in each frame. This enables consistent tracking and accurate identity management across frames, enhancing both performance and reliability through the VIS process. Moreover, we introduce Decoupled Object Association (DOA), a strategy that separately handles newly appearing and already existing objects. By leveraging our memory system, DOA accurately assigns object indices, improving matching accuracy and ensuring stable identity consistency, even in dynamic scenes where objects frequently appear and disappear. Extensive experiments and ablation studies demonstrate the superiority of our method over traditional approaches, setting a new benchmark in VIS. Notably, our LOMM achieves state-of-the-art AP score of 54.0 on YouTube-VIS 2022, a dataset known for its challenging long videos. Project page: https://seung-hun-lee.github.io/projects/LOMM/
Chinese: 本文提出最新对象记忆管理(LOMM)方法,通过最新对象记忆系统和解耦对象关联策略,显著提升视频实例分割中的长期目标跟踪性能,在YouTube-VIS 2022数据集上以54.0 AP分数创下最新纪录。
English: This paper introduces Latest Object Memory Management (LOMM), a novel method for video instance segmentation that enhances long-term object tracking through a Latest Object Memory system and Decoupled Object Association strategy, achieving state-of-the-art performance with a 54.0 AP score on YouTube-VIS 2022.
Authors:Lin Ren, Guohui Xiao, Guilin Qi, Yishuai Geng, Haohan Xue
Abstract:
Answer Set Programming (ASP) is a powerful paradigm for non-monotonic reasoning. Recently, large language models (LLMs) have demonstrated promising capabilities in logical reasoning. Despite this potential, current evaluations of LLM capabilities in ASP are often limited. Existing works normally employ overly simplified ASP programs, do not support negation, disjunction, or multiple answer sets. Furthermore, there is a lack of benchmarks that introduce tasks specifically designed for ASP solving. To bridge this gap, we introduce ASPBench, a comprehensive ASP benchmark, including three ASP specific tasks: ASP entailment, answer set verification, and answer set computation. Our extensive evaluations on ASPBench reveal that while 14 state-of-the-art LLMs, including \emph{deepseek-r1}, \emph{o4-mini}, and \emph{gemini-2.5-flash-thinking}, perform relatively well on the first two simpler tasks, they struggle with answer set computation, which is the core of ASP solving. These findings offer insights into the current limitations of LLMs in ASP solving. This highlights the need for new approaches that integrate symbolic reasoning capabilities more effectively. The code and dataset are available at https://github.com/HomuraT/ASPBench.
中文: ASPBench这一新基准测试表明,尽管大语言模型在ASP蕴含和验证等简单任务上表现尚可,但在核心的答案集计算任务上存在明显不足,凸显了加强符号推理能力融合的必要性。
English: ASPBench, a new benchmark for Answer Set Programming, reveals that while large language models handle simpler tasks like entailment and verification, they struggle with the core task of answer set computation, highlighting the need for better integration of symbolic reasoning.
Authors:Yinzhou Tang, Huandong Wang, Xiaochen Fan, Yong Li
Abstract:
The vulnerability of cities to natural disasters has increased with urbanization and climate change, making it more important to predict human mobility in the disaster scenarios for downstream tasks including location-based early disaster warning and pre-allocating rescue resources, etc. However, existing human mobility prediction models are mainly designed for normal scenarios, and fail to adapt to disaster scenarios due to the shift of human mobility patterns under disaster. To address this issue, we introduce \textbf{DisasterMobLLM}, a mobility prediction framework for disaster scenarios that can be integrated into existing deep mobility prediction methods by leveraging LLMs to model the mobility intention and transferring the common knowledge of how different disasters affect mobility intentions between cities. This framework utilizes a RAG-Enhanced Intention Predictor to forecast the next intention, refines it with an LLM-based Intention Refiner, and then maps the intention to an exact location using an Intention-Modulated Location Predictor. Extensive experiments illustrate that DisasterMobLLM can achieve a 32.8\% improvement in terms of Acc@1 and a 35.0\% improvement in terms of the F1-score of predicting immobility compared to the baselines. The code is available at https://github.com/tsinghua-fib-lab/DisasterMobLLM.
中文总结:DisasterMobLLM是一种创新框架,通过利用大语言模型模拟移动意图并迁移跨城市灾害知识,显著提升了自然灾害场景下人类移动预测的准确性。
English Summary: DisasterMobLLM is a novel framework that leverages large language models to significantly improve human mobility prediction during natural disasters by modeling mobility intentions and transferring cross-city disaster knowledge.
Authors:Liyang Wang, Shiqian Wu, Shun Fang, Qile Zhu, Jiaxin Wu, Sos Again
Abstract:
Moving target detection is a challenging computer vision task aimed at generating accurate segmentation maps in diverse in-the-wild color videos captured by static cameras. If backgrounds and targets can be simultaneously extracted and recombined, such synthetic data can significantly enrich annotated in-the-wild datasets and enhance the generalization ability of deep models. Quaternion-based RPCA (QRPCA) is a promising unsupervised paradigm for color image processing. However, in color video processing, Quaternion Singular Value Decomposition (QSVD) incurs high computational costs, and rank-1 quaternion matrix fails to yield rank-1 color channels. In this paper, we reduce the computational complexity of QSVD to o(1) by utilizing a quaternion Riemannian manifold. Furthermor, we propose the universal QRPCA (uQRPCA) framework, which achieves a balance in simultaneously segmenting targets and recovering backgrounds from color videos. Moreover, we expand to uQRPCA+ by introducing the Color Rank-1 Batch (CR1B) method to further process and obtain the ideal low-rank background across color channels. Experiments demonstrate our uQRPCA+ achieves State Of The Art (SOTA) performance on moving target detection and background recovery tasks compared to existing open-source methods. Our implementation is publicly available on GitHub at https://github.com/Ruchtech/uQRPCA
中文: 本文提出的uQRPCA+框架通过四元数黎曼流形降低计算复杂度,并采用颜色秩一批处理方法,在彩色视频运动目标检测和背景恢复任务中实现了最先进的性能。
English: This paper introduces the uQRPCA+ framework, which leverages quaternion Riemannian manifolds to reduce computational complexity and employs the Color Rank-1 Batch method to achieve state-of-the-art performance in moving target detection and background recovery from color videos.
Authors:Zhaoliang Zheng, Xu Han, Yuxin Bao, Yun Zhang, Johnson Liu, Zonglin Meng, Xin Xia, Jiaqi Ma
Abstract:
Cooperative Driving Automation (CDA) has garnered increasing research attention, yet the role of intelligent infrastructure remains insufficiently explored. Existing solutions offer limited support for addressing long-tail challenges, real-synthetic data fusion, and heterogeneous sensor management. This paper introduces CDA-SimBoost, a unified framework that constructs infrastructure-centric simulation environments from real-world data. CDA-SimBoost consists of three main components: a Digital Twin Builder for generating high-fidelity simulator assets based on sensor and HD map data, OFDataPip for processing both online and offline data streams, and OpenCDA-InfraX, a high-fidelity platform for infrastructure-focused simulation. The system supports realistic scenario construction, rare event synthesis, and scalable evaluation for CDA research. With its modular architecture and standardized benchmarking capabilities, CDA-SimBoost bridges real-world dynamics and virtual environments, facilitating reproducible and extensible infrastructure-driven CDA studies. All resources are publicly available at https://github.com/zhz03/CDA-SimBoost
中文摘要:本文提出CDA-SimBoost这一以智能基础设施为核心的统一仿真框架,通过数字孪生构建和高保真仿真平台,有效连接真实驾驶场景与虚拟环境,为协同驾驶自动化研究提供可扩展的标准化测试方案。
English Summary: This paper introduces CDA-SimBoost, a unified infrastructure-centric simulation framework that bridges real-world data with virtual environments to address challenges in Cooperative Driving Automation research, featuring modular components for digital twin construction and scalable evaluation.
Authors:Faruk Alpay, Hamdi Alakkad, Bugra Kilictas, Taylan Alpay
Abstract:
We develop an operator algebraic framework for infinite games with a continuum of agents and prove that regret based learning dynamics governed by a noncommutative continuity equation converge to a unique quantal response equilibrium under mild regularity assumptions. The framework unifies functional analysis, coarse geometry and game theory by assigning to every game a von Neumann algebra that represents collective strategy evolution. A reflective regret operator within this algebra drives the flow of strategy distributions and its fixed point characterises equilibrium. We introduce the ordinal folding index, a computable ordinal valued metric that measures the self referential depth of the dynamics, and show that it bounds the transfinite time needed for convergence, collapsing to zero on coarsely amenable networks. The theory yields new invariant subalgebra rigidity results, establishes existence and uniqueness of envy free and maximin share allocations in continuum economies, and links analytic properties of regret flows with empirical stability phenomena in large language models. These contributions supply a rigorous mathematical foundation for large scale multi agent systems and demonstrate the utility of ordinal metrics for equilibrium selection.
中文摘要:本研究提出了一个针对连续体智能体无限博弈的算子代数框架,证明了基于遗憾的学习动态会收敛到唯一的量子响应均衡,并为大规模多智能体系统建立了严格的数学基础。
English Summary: This study introduces an operator algebraic framework for infinite games with a continuum of agents, demonstrating that regret-based learning converges to a unique quantal response equilibrium and providing new mathematical foundations for large-scale multi-agent systems.
Authors:Maria Emilia Mazzolenis, Ruirui Zhang
Abstract:
Large language models (LLMs) are increasingly applied in task-oriented dialogue (TOD) systems but often struggle with long, conditional workflows that involve external tool calls and depend on user-specific information. We present Workflow Adherence via Runtime Parallel Personalization, or WARPP, a training-free, modular framework that combines multi-agent orchestration with runtime personalization to improve workflow adherence in LLM-based systems. By dynamically pruning conditional branches based on user attributes, the framework reduces reasoning overhead and narrows tool selection at runtime. WARPP deploys a parallelized architecture where a dedicated Personalizer agent operates alongside modular, domain-specific agents to dynamically tailor execution paths in real time. The framework is evaluated across five representative user intents of varying complexity within three domains: banking, flights, and healthcare. Our evaluation leverages synthetic datasets and LLM-powered simulated users to test scenarios with conditional dependencies. Our results demonstrate that WARPP outperforms both the non-personalized method and the ReAct baseline, achieving increasingly larger gains in parameter fidelity and tool accuracy as intent complexity grows, while also reducing average token usage, without any additional training.
中文摘要:WARPP框架通过多智能体协同与实时个性化相结合,无需额外训练即可动态优化基于大语言模型的任务型对话系统的工作流执行效果。
English summary: The WARPP framework enhances LLM-based task-oriented dialogue systems by integrating multi-agent orchestration with runtime personalization, dynamically optimizing workflow execution without requiring additional training.
Authors:Chenchen Zhao, Zhengyuan Shi, Xiangyu Wen, Chengjie Liu, Yi Liu, Yunhao Zhou, Yuxiang Zhao, Hefei Feng, Yinan Zhu, Gwok-Waa Wan, Xin Cheng, Weiyu Chen, Yongqi Fu, Chujie Chen, Chenhao Xue, Guangyu Sun, Ying Wang, Yibo Lin, Jun Yang, Ning Xu, Xi Wang, Qiang Xu
Abstract:
The emergence of multimodal large language models (MLLMs) presents promising opportunities for automation and enhancement in Electronic Design Automation (EDA). However, comprehensively evaluating these models in circuit design remains challenging due to the narrow scope of existing benchmarks. To bridge this gap, we introduce MMCircuitEval, the first multimodal benchmark specifically designed to assess MLLM performance comprehensively across diverse EDA tasks. MMCircuitEval comprises 3614 meticulously curated question-answer (QA) pairs spanning digital and analog circuits across critical EDA stages - ranging from general knowledge and specifications to front-end and back-end design. Derived from textbooks, technical question banks, datasheets, and real-world documentation, each QA pair undergoes rigorous expert review for accuracy and relevance. Our benchmark uniquely categorizes questions by design stage, circuit type, tested abilities (knowledge, comprehension, reasoning, computation), and difficulty level, enabling detailed analysis of model capabilities and limitations. Extensive evaluations reveal significant performance gaps among existing LLMs, particularly in back-end design and complex computations, highlighting the critical need for targeted training datasets and modeling approaches. MMCircuitEval provides a foundational resource for advancing MLLMs in EDA, facilitating their integration into real-world circuit design workflows. Our benchmark is available at https://github.com/cure-lab/MMCircuitEval.
中文:MMCircuitEval基准的推出旨在全面评估多模态大语言模型在电子设计自动化中的表现,揭示了显著的性能差距,并为提升这些模型在电路设计中的应用提供了基础资源。
English: The MMCircuitEval benchmark is introduced to comprehensively evaluate multimodal large language models in Electronic Design Automation, revealing significant performance gaps and providing a foundational resource for advancing these models in circuit design.
Authors:Xingyu Su, Xiner Li, Yuchao Lin, Ziqian Xie, Degui Zhi, Shuiwang Ji
Abstract:
We consider controllable DNA sequence design, where sequences are generated by conditioning on specific biological properties. While language models (LMs) such as GPT and BERT have achieved remarkable success in natural language generation, their application to DNA sequence generation remains largely underexplored. In this work, we introduce ATGC-Gen, an Automated Transformer Generator for Controllable Generation, which leverages cross-modal encoding to integrate diverse biological signals. ATGC-Gen is instantiated with both decoder-only and encoder-only transformer architectures, allowing flexible training and generation under either autoregressive or masked recovery objectives. We evaluate ATGC-Gen on representative tasks including promoter and enhancer sequence design, and further introduce a new dataset based on ChIP-Seq experiments for modeling protein binding specificity. Our experiments demonstrate that ATGC-Gen can generate fluent, diverse, and biologically relevant sequences aligned with the desired properties. Compared to prior methods, our model achieves notable improvements in controllability and functional relevance, highlighting the potential of language models in advancing programmable genomic design. The source code is released at (https://github.com/divelab/AIRS/blob/main/OpenBio/ATGC_Gen).
中文: 本研究提出ATGC-Gen这一基于Transformer的可控DNA序列生成模型,通过跨模态编码整合生物信号,在生成生物相关性序列方面较现有方法展现出更优的可控性和功能适配性。
English: This work introduces ATGC-Gen, a transformer-based model for controllable DNA sequence generation that integrates biological signals through cross-modal encoding, demonstrating superior performance in producing biologically relevant sequences compared to prior methods.
Authors:Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Zexue He, Shafiq Abedin, Jennifer Sun, Ben Wiesel, Eli Schwartz, Ahmed Nassar, Bo Wu, Assaf Arbelle, Aude Oliva, Dan Gutfreund, Leonid Karlinsky, Rogerio Feris
Abstract:
Chart-to-code reconstruction -- the task of recovering executable plotting scripts from chart images -- provides important insights into a model's ability to ground data visualizations in precise, machine-readable form. Yet many existing multimodal benchmarks largely focus primarily on answering questions about charts or summarizing them. To bridge this gap, we present ChartGen, a fully-automated pipeline for code-guided synthetic chart generation. Starting from seed chart images, ChartGen (i) prompts a vision-language model (VLM) to reconstruct each image into a python script, and (ii) iteratively augments that script with a code-oriented large language model (LLM). Using ChartGen, we create 222.5K unique chart-image code pairs from 13K seed chart images, and present an open-source synthetic chart dataset covering 27 chart types, 11 plotting libraries, and multiple data modalities (image, code, text, CSV, DocTags). From this corpus, we curate a held-out chart-to-code evaluation subset of 4.3K chart image-code pairs, and evaluate six open-weight VLMs (3B - 26B parameters), highlighting substantial room for progress. We release the pipeline, prompts, and the dataset to help accelerate efforts towards robust chart understanding and vision-conditioned code generation: https://github.com/SD122025/ChartGen/
中文:本文提出了ChartGen,一个自动化生成图表-代码对的流程,旨在填补多模态基准在图表到代码重建任务上的空白,通过创建包含多种图表类型和绘图库的数据集,并评估了多个视觉语言模型的性能。
English: This paper introduces ChartGen, an automated pipeline that generates synthetic chart-image code pairs to advance chart-to-code reconstruction, addressing a gap in multimodal benchmarks by creating a comprehensive dataset and evaluating several vision-language models.
Authors:Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang
Abstract:
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.
中文:MMBench-GUI是一个跨平台的GUI自动化代理分层评测基准,包含四个技能层级和创新的EQA效率评估指标,强调精准视觉定位、任务规划与效率优化对实现可靠自动化的重要作用。
English: MMBench-GUI is a cross-platform hierarchical benchmark for evaluating GUI automation agents, featuring four skill levels and a novel EQA metric to assess efficiency, while highlighting the critical roles of visual grounding, task planning, and efficiency optimization for reliable automation.
Authors:Chen Zhu, Wangbo Zhao, Huiwen Zhang, Samir Khaki, Yuhao Zhou, Weidong Tang, Shuo Wang, Zhihang Yuan, Yuzhang Shang, Xiaojiang Peng, Kai Wang, Dawei Yang
Abstract:
Vision Transformers (ViTs) have emerged as a foundational model in computer vision, excelling in generalization and adaptation to downstream tasks. However, deploying ViTs to support diverse resource constraints typically requires retraining multiple, size-specific ViTs, which is both time-consuming and energy-intensive. To address this issue, we propose an efficient ViT adaptation framework that enables a single adaptation process to generate multiple models of varying sizes for deployment on platforms with various resource constraints. Our approach comprises two stages. In the first stage, we enhance a pre-trained ViT with a nested elastic architecture that enables structural flexibility across MLP expansion ratio, number of attention heads, embedding dimension, and network depth. To preserve pre-trained knowledge and ensure stable adaptation, we adopt a curriculum-based training strategy that progressively increases elasticity. In the second stage, we design a lightweight router to select submodels according to computational budgets and downstream task demands. Initialized with Pareto-optimal configurations derived via a customized NSGA-II algorithm, the router is then jointly optimized with the backbone. Extensive experiments on multiple benchmarks demonstrate the effectiveness and versatility of EA-ViT. The code is available at https://github.com/zcxcf/EA-ViT.
中文摘要:本文提出EA-ViT框架,通过嵌套弹性架构和渐进式课程训练,仅需一次适配即可将预训练视觉Transformer转换为多种尺寸的部署模型,有效解决了传统方法需为不同资源限制重复训练多个模型的问题。
English Summary: The paper introduces EA-ViT, an efficient adaptation framework that transforms a single pre-trained Vision Transformer into multiple scalable models through a nested elastic architecture and curriculum training, eliminating the need for retraining separate models for different resource constraints.
Authors:Minghao Tang, Shiyu Ni, Jiafeng Guo, Keping Bi
Abstract:
Retrieval-augmented generation (RAG) has been widely adopted to augment large language models (LLMs) with external knowledge for knowledge-intensive tasks. However, its effectiveness is often undermined by the presence of noisy (i.e., low-quality) retrieved passages. Enhancing LLMs' robustness to such noise is critical for improving the reliability of RAG systems. Recent advances have equipped LLMs with strong reasoning and self-reflection capabilities, allowing them to identify and correct errors in their reasoning process. Inspired by this ability, we propose Passage Injection-a simple yet effective method that explicitly incorporates retrieved passages into LLMs' reasoning process, aiming to enhance the model's ability to recognize and resist noisy passages. We validate Passage Injection under general RAG settings using BM25 as the retriever. Experiments on four reasoning-enhanced LLMs across four factual QA datasets demonstrate that Passage Injection significantly improves overall RAG performance. Further analysis on two noisy retrieval settings-random noise, where the model is provided irrelevant passages, and counterfactual noise, where it is given misleading passages-shows that Passage Injection consistently improves robustness. Controlled experiments confirm that Passage Injection can also effectively leverage helpful passages. These findings suggest that incorporating passages in LLMs' reasoning process is a promising direction for building more robust RAG systems. The code can be found \href{here}{https://github.com/mh-tang/Passage-Injection}.
Chinese: 段落注入是一种创新方法,通过将检索到的段落显式融入大语言模型的推理过程,有效增强其对噪声信息的鲁棒性,从而提升检索增强生成系统的整体性能。
English: Passage Injection is a novel method that explicitly integrates retrieved passages into large language models' reasoning process to enhance their robustness against noisy information and improve overall performance in retrieval-augmented generation systems.
Authors:Minghao Tang, Shiyu Ni, Jiafeng Guo, Keping Bi
Abstract:
Retrieval-augmented generation (RAG) has been widely adopted to augment large language models (LLMs) with external knowledge for knowledge-intensive tasks. However, its effectiveness is often undermined by the presence of noisy (i.e., low-quality) retrieved passages. Enhancing LLMs' robustness to such noise is critical for improving the reliability of RAG systems. Recent advances have equipped LLMs with strong reasoning and self-reflection capabilities, allowing them to identify and correct errors in their reasoning process. Inspired by this ability, we propose Passage Injection-a simple yet effective method that explicitly incorporates retrieved passages into LLMs' reasoning process, aiming to enhance the model's ability to recognize and resist noisy passages. We validate Passage Injection under general RAG settings using BM25 as the retriever. Experiments on four reasoning-enhanced LLMs across four factual QA datasets demonstrate that Passage Injection significantly improves overall RAG performance. Further analysis on two noisy retrieval settings-random noise, where the model is provided irrelevant passages, and counterfactual noise, where it is given misleading passages-shows that Passage Injection consistently improves robustness. Controlled experiments confirm that Passage Injection can also effectively leverage helpful passages. These findings suggest that incorporating passages in LLMs' reasoning process is a promising direction for building more robust RAG systems. The code can be found \href{here}{https://github.com/Trustworthy-Information-Access/Passage-Injection}.
Chinese: 段落注入是一种创新方法,通过将检索到的段落显式融入大语言模型的推理过程,有效增强其对噪声信息的鲁棒性,从而提升检索增强生成系统的整体性能。
English: Passage Injection is a novel method that explicitly integrates retrieved passages into large language models' reasoning process to enhance their robustness against noisy information and improve overall performance in retrieval-augmented generation systems.
Authors:Muhammad Ibrahim, Naveed Akhtar, Haitian Wang, Saeed Anwar, Ajmal Mian
Abstract:
Fusion of LiDAR and RGB data has the potential to enhance outdoor 3D object detection accuracy. To address real-world challenges in outdoor 3D object detection, fusion of LiDAR and RGB input has started gaining traction. However, effective integration of these modalities for precise object detection task still remains a largely open problem. To address that, we propose a MultiStream Detection (MuStD) network, that meticulously extracts task-relevant information from both data modalities. The network follows a three-stream structure. Its LiDAR-PillarNet stream extracts sparse 2D pillar features from the LiDAR input while the LiDAR-Height Compression stream computes Bird's-Eye View features. An additional 3D Multimodal stream combines RGB and LiDAR features using UV mapping and polar coordinate indexing. Eventually, the features containing comprehensive spatial, textural and geometric information are carefully fused and fed to a detection head for 3D object detection. Our extensive evaluation on the challenging KITTI Object Detection Benchmark using public testing server at https://www.cvlibs.net/datasets/kitti/eval_object_detail.php?&result=d162ec699d6992040e34314d19ab7f5c217075e0 establishes the efficacy of our method by achieving new state-of-the-art or highly competitive results in different categories while remaining among the most efficient methods. Our code will be released through MuStD GitHub repository at https://github.com/IbrahimUWA/MuStD.git
Chinese: 提出的多流检测(MuStD)网络通过三流架构有效融合激光雷达与RGB数据,在KITTI基准测试中实现了最优的三维物体检测性能,同时保持高效运行。
English: The proposed MultiStream Detection (MuStD) network effectively fuses LiDAR and RGB data through a three-stream architecture to achieve state-of-the-art 3D object detection performance on the KITTI benchmark while maintaining high efficiency.
Authors:Guoping Xu, Yan Dai, Hengrui Zhao, Ying Zhang, Jie Deng, Weiguo Lu, You Zhang
Abstract:
Purpose: Accurate tumor segmentation is vital for adaptive radiation therapy (ART) but remains time-consuming and user-dependent. Segment Anything Model 2 (SAM2) shows promise for prompt-based segmentation but struggles with tumor accuracy. We propose prior knowledge-based augmentation strategies to enhance SAM2 for ART.
Methods: Two strategies were introduced to improve SAM2: (1) using prior MR images and annotations as contextual inputs, and (2) improving prompt robustness via random bounding box expansion and mask erosion/dilation. The resulting model, SAM2-Aug, was fine-tuned and tested on the One-Seq-Liver dataset (115 MRIs from 31 liver cancer patients), and evaluated without retraining on Mix-Seq-Abdomen (88 MRIs, 28 patients) and Mix-Seq-Brain (86 MRIs, 37 patients).
Results: SAM2-Aug outperformed convolutional, transformer-based, and prompt-driven models across all datasets, achieving Dice scores of 0.86(liver), 0.89(abdomen), and 0.90(brain). It demonstrated strong generalization across tumor types and imaging sequences, with improved performance in boundary-sensitive metrics.
Conclusions: Incorporating prior images and enhancing prompt diversity significantly boosts segmentation accuracy and generalizability. SAM2-Aug offers a robust, efficient solution for tumor segmentation in ART. Code and models will be released at https://github.com/apple1986/SAM2-Aug.
中文摘要:本研究通过整合先验知识和增强提示多样性,提升了Segment Anything Model 2(SAM2)在自适应放疗肿瘤分割中的性能,使其在多个数据集上展现出卓越的准确性和泛化能力。
English Summary: The study enhances the Segment Anything Model 2 (SAM2) for tumor segmentation in adaptive radiation therapy by incorporating prior knowledge and improving prompt robustness, resulting in superior performance across diverse datasets.
Authors:An Xiang, Zixuan Huang, Xitong Gao, Kejiang Ye, Cheng-zhong Xu
Abstract:
Industrial anomaly detection for 2D objects has gained significant attention and achieved progress in anomaly detection (AD) methods. However, identifying 3D depth anomalies using only 2D information is insufficient. Despite explicitly fusing depth information into RGB images or using point cloud backbone networks to extract depth features, both approaches struggle to adequately represent 3D information in multimodal scenarios due to the disparities among different modal information. Additionally, due to the scarcity of abnormal samples in industrial data, especially in multimodal scenarios, it is necessary to perform anomaly generation to simulate real-world abnormal samples. Therefore, we propose a novel unified multimodal anomaly detection framework to address these issues. Our contributions consist of 3 key aspects. (1) We extract visible depth information from 3D point cloud data simply and use 2D RGB images to represent appearance, which disentangles depth and appearance to support unified anomaly generation. (2) Benefiting from the flexible input representation, the proposed Multi-Scale Gaussian Anomaly Generator and Unified Texture Anomaly Generator can generate richer anomalies in RGB and depth. (3) All modules share parameters for both RGB and depth data, effectively bridging 2D and 3D anomaly detection. Subsequent modules can directly leverage features from both modalities without complex fusion. Experiments show our method outperforms state-of-the-art (SOTA) on MVTec-3D AD and Eyecandies datasets. Code available at: https://github.com/Xantastic/BridgeNet
中文: 本文提出了一种统一的多模态异常检测框架,通过分离深度与外观特征实现高效异常生成,在三维工业数据集上超越了现有最优方法。
English: This paper introduces a unified multimodal anomaly detection framework that disentangles depth and appearance features to enable effective anomaly generation and outperforms state-of-the-art methods on 3D industrial datasets.
Authors:Yifan Zhang
Abstract:
Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms--how training shapes their representations and enables complex behaviors--remains elusive. We introduce a new analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This compositional perspective provides a unified mathematical language to connect three critical aspects of language modeling that are typically studied in isolation: the training objective, the geometry of the learned representation space, and practical model capabilities. First, our framework provides a precise information-theoretic rationale for the success of multi-token prediction methods like speculative decoding, quantifying the "information surplus" a model's hidden state contains about tokens beyond the immediate next one. Second, we clarify how the standard negative log-likelihood (NLL) objective compels the model to learn not just the next word, but also the data's intrinsic conditional uncertainty, a process we formalize using categorical entropy. Our central result reveals that NLL training functions as an implicit form of spectral contrastive learning. We prove that, for common model architectures, this simple predictive objective forces the model to sculpt a geometrically structured representation space, implicitly aligning representations with the eigenspectrum of a "predictive similarity" operator. This work offers a powerful new lens to understand how information flows through a model and how the training objective shapes its internal geometry, thereby bridging the gap between learning theory and the practical success of large language models.
中文摘要:本文提出了一种基于马尔可夫范畴的组合分析框架,将自回归语言模型的训练目标、表示空间几何与实践能力相统一,通过信息论阐释了多标记预测等现象,并揭示了负对数似然目标诱导表示空间与预测算子特征谱对齐的机制。
English Summary: This paper introduces a compositional framework using Markov categories to unify the training objective, representation geometry, and capabilities of autoregressive language models, explaining phenomena like multi-token prediction and spectral alignment through information theory.
Authors:Yifan Zhang
Abstract:
Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes their representations, and enables complex behaviors, remains elusive. We introduce a new analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This compositional perspective provides a unified mathematical language to connect three critical aspects of language modeling that are typically studied in isolation: the training objective, the geometry of the learned representation space, and practical model capabilities. First, our framework provides a precise information-theoretic rationale for the success of multi-token prediction methods like speculative decoding, quantifying the information surplus a model's hidden state contains about tokens beyond the immediate next one. Second, we clarify how the standard negative log-likelihood (NLL) objective compels the model to learn not just the next word, but also the data's intrinsic conditional uncertainty, a process we formalize using categorical entropy. Our central result shows that, under a linear-softmax head with bounded features, minimizing NLL induces spectral alignment: the learned representation space aligns with the eigenspectrum of a predictive similarity operator. This work presents a powerful new lens for understanding how information flows through a model and how the training objective shapes its internal geometry.
中文摘要:本文提出了一种基于马尔可夫范畴的组合分析框架,将自回归语言模型的训练目标、表示空间几何与实践能力相统一,通过信息论阐释了多标记预测等现象,并揭示了负对数似然目标诱导表示空间与预测算子特征谱对齐的机制。
English Summary: This paper introduces a compositional framework using Markov categories to unify the training objective, representation geometry, and capabilities of autoregressive language models, explaining phenomena like multi-token prediction and spectral alignment through information theory.
Authors:Jiaru Zhong, Jiahao Wang, Jiahui Xu, Xiaofan Li, Zaiqing Nie, Haibao Yu
Abstract:
Cooperative perception aims to address the inherent limitations of single-vehicle autonomous driving systems through information exchange among multiple agents. Previous research has primarily focused on single-frame perception tasks. However, the more challenging cooperative sequential perception tasks, such as cooperative 3D multi-object tracking, have not been thoroughly investigated. Therefore, we propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking, featuring learnable instance association, which fundamentally differs from existing approaches. CoopTrack transmits sparse instance-level features that significantly enhance perception capabilities while maintaining low transmission costs. Furthermore, the framework comprises two key components: Multi-Dimensional Feature Extraction, and Cross-Agent Association and Aggregation, which collectively enable comprehensive instance representation with semantic and motion features, and adaptive cross-agent association and fusion based on a feature graph. Experiments on both the V2X-Seq and Griffin datasets demonstrate that CoopTrack achieves excellent performance. Specifically, it attains state-of-the-art results on V2X-Seq, with 39.0\% mAP and 32.8\% AMOTA. The project is available at https://github.com/zhongjiaru/CoopTrack.
协作感知通过多智能体信息共享提升自动驾驶能力,CoopTrack提出创新的实例级框架实现高效追踪,在基准测试中达到顶尖性能。
Cooperative perception enhances autonomous driving by enabling multi-agent information sharing, with CoopTrack introducing an innovative instance-level framework for efficient tracking that achieves top performance on benchmark datasets.
Authors:Tianfu Wang, Liwei Deng, Xi Chen, Junyang Wang, Huiguo He, Leilei Ding, Wei Wu, Qilin Fan, Hui Xiong
Abstract:
Resource allocation (RA) is critical to efficient service deployment in Network Function Virtualization (NFV), a transformative networking paradigm. Recently, deep Reinforcement Learning (RL)-based methods have been showing promising potential to address this complexity. However, the lack of a systematic benchmarking framework and thorough analysis hinders the exploration of emerging networks and the development of more robust algorithms while causing inconsistent evaluation. In this paper, we introduce Virne, a comprehensive benchmarking framework for the NFV-RA problem, with a focus on supporting deep RL-based methods. Virne provides customizable simulations for diverse network scenarios, including cloud, edge, and 5G environments. It also features a modular and extensible implementation pipeline that supports over 30 methods of various types, and includes practical evaluation perspectives beyond effectiveness, such as scalability, generalization, and scalability. Furthermore, we conduct in-depth analysis through extensive experiments to provide valuable insights into performance trade-offs for efficient implementation and offer actionable guidance for future research directions. Overall, with its diverse simulations, rich implementations, and extensive evaluation capabilities, Virne could serve as a comprehensive benchmark for advancing NFV-RA methods and deep RL applications. The code is publicly available at https://github.com/GeminiLight/virne.
中文: 本文介绍了Virne,一个全面的NFV资源分配基准框架,支持深度强化学习方法,提供可定制模拟和广泛评估功能,以推动该领域的研究进展。
English: This paper introduces Virne, a comprehensive benchmarking framework for NFV resource allocation that supports deep reinforcement learning methods with customizable simulations and extensive evaluation capabilities to advance research in the field.
Authors:Zi Liang, Liantong Yu, Shiyu Zhang, Qingqing Ye, Haibo Hu
Abstract:
Overestimation in evaluating large language models (LLMs) has become an increasing concern. Due to the contamination of public benchmarks or imbalanced model training, LLMs may achieve unreal evaluation results on public benchmarks, either intentionally or unintentionally, which leads to unfair comparisons among LLMs and undermines their realistic capability assessments. Existing benchmarks attempt to address these issues by keeping test cases permanently secret, mitigating contamination through human evaluation, or repeatedly collecting and constructing new samples. However, these approaches fail to ensure reproducibility, transparency, and high efficiency simultaneously. Moreover, the extent of overestimation in current LLMs remains unquantified. To address these issues, we propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography. ArxivRoll comprises two key components: \emph{i) SCP (Sequencing, Cloze, and Prediction)}, an automated generator for private test cases, and \emph{ii) Rugged Scores (RS)}, metrics that measure the proportion of public benchmark contamination and training bias. Leveraging SCP, ArxivRoll constructs a new benchmark every six months using recent articles from ArXiv and employs them for one-time evaluations of LLM performance. Extensive experiments demonstrate the high quality of our benchmark, and we provide a systematic evaluation of current LLMs. The source code is available at https://github.com/liangzid/ArxivRoll/.
中文: 摘要提出了ArxivRoll动态评估框架,通过生成私有测试用例并量化污染与偏差,旨在解决大型语言模型评估中的高估问题,确保公平且真实的性能衡量。
English: The abstract introduces ArxivRoll, a dynamic evaluation framework designed to address overestimation in large language models by generating private test cases and measuring contamination and bias, ensuring fair and realistic assessments.
Authors:Hanbing Wu, Ping Jiang, Anyang Su, Chenxu Zhao, Tianyu Fu, Minghui Wu, Beiping Tan, Huiying Li
Abstract:
Visual selective attention, driven by individual preferences, regulates human prioritization of visual stimuli by bridging subjective cognitive mechanisms with objective visual elements, thereby steering the semantic interpretation and hierarchical processing of dynamic visual scenes. However, existing models and datasets predominantly neglect the influence of subjective cognitive diversity on fixation behavior. Conventional saliency prediction models, typically employing segmentation approaches, rely on low-resolution imagery to generate saliency heatmaps, subsequently upscaled to native resolutions, which limiting their capacity to capture personalized attention patterns. Furthermore, MLLMs are constrained by factors such as hallucinations, making it very costly to strictly adhere to the expected format in tasks involving multiple point predictions, and achieving precise point positioning is challenging. To address these limitations, we present Subjective Personalized Attention for Advertisement Videos, namely SPA-ADV, a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants varying in age and gender with 486 videos. Furthermore, we propose PRE-MAP, a novel eye-tracking saliency model that characterizes Personalized visual disparities through Reinforcement learning-optimized Eye-tracking, built upon MLLMs and guided by Multi-Attribute user profiles to predict Points. To ensure MLLMs produce prediction points that are both format-correct and spatially accurate, we introduce Consistency Group Relative Policy Optimization (C-GRPO), inspired by the variability in eye movement points and Multi-Attribute profiles. Extensive experiments on SPA-ADV and other benchmarks demonstrate the effectiveness of our approach. The code and dataset are available at \href{https://github.com/mininglamp-MLLM/PRE-MAP}{this URL}.
中文: 本研究提出了SPA-ADV大规模多模态数据集以捕捉多样化注视行为,并开发了PRE-MAP模型,该模型通过强化学习优化,利用多属性用户画像预测个性化视觉注意点,同时采用创新优化方法解决多模态大模型在格式规范与定位精度方面的挑战。
English: This research introduces SPA-ADV, a large-scale multimodal dataset capturing diverse gaze behaviors, and proposes PRE-MAP, a reinforcement learning-optimized model that leverages multi-attribute profiles to predict personalized visual attention points while addressing format and accuracy challenges in MLLMs through innovative optimization techniques.
Authors:Etienne Buehrle, Ãmer Åahin TaÅ, Christoph Stiller
Abstract:
We propose an approach to trajectory optimization for piecewise polynomial systems based on the recently proposed graphs of convex sets framework. We instantiate the framework with a convex relaxation of optimal control based on occupation measures, resulting in a convex optimization problem resembling the discrete shortest-paths linear program that can be solved efficiently to global optimality. While this approach inherits the limitations of semidefinite programming, scalability to large numbers of discrete modes improves compared to the NP-hard mixed-integer formulation. We use this to plan trajectories under temporal logic specifications, comparing the computed cost lower bound to a nonconvex optimization approach with fixed mode sequence. In our numerical experiments, we find that this bound is typically in the vicinity of the nonconvex solution, while the runtime speedup is significant compared to the often intractable mixed-integer formulation. Our implementation is available at https://github.com/ebuehrle/hpoc.
中文: 本文提出了一种基于凸集图框架的轨迹优化方法,通过占用度量的凸松弛高效求解最优控制问题至全局最优,相比混合整数规划显著提升计算速度,且所得成本下界与非凸解接近。
English: This paper introduces a trajectory optimization method using the graphs of convex sets framework, which efficiently solves convex relaxations of optimal control problems to global optimality and provides tight cost bounds with significant runtime improvements over mixed-integer approaches.
Authors:Simon Malan, Benjamin van Niekerk, Herman Kamper
Abstract:
We investigate the problem of segmenting unlabeled speech into word-like units and clustering these to create a lexicon. Prior work can be categorized into two frameworks. Bottom-up methods first determine boundaries and then cluster the fixed segmented words into a lexicon. In contrast, top-down methods incorporate information from the clustered words to inform boundary selection. However, it is unclear whether top-down information is necessary to improve segmentation. To explore this, we look at two similar approaches that differ in whether top-down clustering informs boundary selection. Our simple bottom-up strategy predicts word boundaries using the dissimilarity between adjacent self-supervised features, then clusters the resulting segments to construct a lexicon. Our top-down system is an updated version of the ES-KMeans dynamic programming method that iteratively uses K-means to update its boundaries. On the five-language ZeroSpeech benchmarks, both approaches achieve comparable state-of-the-art results, with the bottom-up system being nearly five times faster. Through detailed analyses, we show that the top-down influence of ES-KMeans can be beneficial (depending on factors like the candidate boundaries), but in many cases the simple bottom-up method performs just as well. For both methods, we show that the clustering step is a limiting factor. Therefore, we recommend that future work focus on improved clustering techniques and learning more discriminative word-like representations. Project code repository: https://github.com/s-malan/prom-seg-clus.
中文摘要:本研究对比了自下而上和自上而下两种将无标签语音分割为类词单元并聚合成词典的方法,发现两种方法均取得相近的先进性能,其中自下而上法速度提升近五倍,同时指出聚类技术是制约性能提升的关键瓶颈。
English Summary: This study compares bottom-up and top-down approaches for segmenting unlabeled speech into word-like units and clustering them into a lexicon, finding both achieve comparable state-of-the-art results with the bottom-up method being significantly faster, while identifying clustering as the key limiting factor for improvement.
Authors:Nao Tokui, Tom Baker
Abstract:
We introduce a novel technique for creative audio resynthesis that operates by reworking the concept of granular synthesis at the latent vector level. Our approach creates a "granular codebook" by encoding a source audio corpus into latent vector segments, then matches each latent grain of a target audio signal to its closest counterpart in the codebook. The resulting hybrid sequence is decoded to produce audio that preserves the target's temporal structure while adopting the source's timbral characteristics. This technique requires no model training, works with diverse audio materials, and naturally avoids the discontinuities typical of traditional concatenative synthesis through the codec's implicit interpolation during decoding. We include supplementary material at https://github.com/naotokui/latentgranular/ , as well as a proof-of-concept implementation to allow users to experiment with their own sounds at https://huggingface.co/spaces/naotokui/latentgranular .
中文: 本文提出一种潜在空间粒状合成技术,通过将目标音频片段与源音频构建的粒状码本匹配,无需模型训练即可生成保留时间结构并融合音色特征的混合音频。
English: This paper presents a latent-level granular synthesis technique that creates hybrid audio by matching target audio grains to a source-based granular codebook, preserving temporal structure while adopting timbral characteristics without model training.
Authors:Xin Li, Kaixiang Yang, Qiang Li, Zhiwei Wang
Abstract:
Mammography is the most commonly used imaging modality for breast cancer screening, driving an increasing demand for deep-learning techniques to support large-scale analysis. However, the development of accurate and robust methods is often limited by insufficient data availability and a lack of diversity in lesion characteristics. While generative models offer a promising solution for data synthesis, current approaches often fail to adequately emphasize lesion-specific features and their relationships with surrounding tissues. In this paper, we propose Gated Conditional Diffusion Model (GCDM), a novel framework designed to jointly synthesize holistic mammogram images and localized lesions. GCDM is built upon a latent denoising diffusion framework, where the noised latent image is concatenated with a soft mask embedding that represents breast, lesion, and their transitional regions, ensuring anatomical coherence between them during the denoising process. To further emphasize lesion-specific features, GCDM incorporates a gated conditioning branch that guides the denoising process by dynamically selecting and fusing the most relevant radiomic and geometric properties of lesions, effectively capturing their interplay. Experimental results demonstrate that GCDM achieves precise control over small lesion areas while enhancing the realism and diversity of synthesized mammograms. These advancements position GCDM as a promising tool for clinical applications in mammogram synthesis. Our code is available at https://github.com/lixinHUST/Gated-Conditional-Diffusion-Model/
Chinese Summary: 门控条件扩散模型(GCDM)是一种创新框架,通过潜在去噪过程结合软掩码嵌入和门控调节分支,能够同时合成完整的乳腺X光图像和局部病灶,有效增强病灶特征与周围组织的解剖一致性。
English Summary: The Gated Conditional Diffusion Model (GCDM) is a novel framework that synthesizes both complete mammogram images and localized lesions by using a latent denoising process with soft mask embeddings and a gated conditioning branch to enhance lesion-specific features and anatomical coherence.
Authors:Kang Wang, Chen Qin, Zhang Shi, Haoran Wang, Xiwen Zhang, Chen Chen, Cheng Ouyang, Chengliang Dai, Yuanhan Mo, Chenchen Dai, Xutong Kuang, Ruizhe Li, Xin Chen, Xiuzheng Yue, Song Tian, Alejandro Mora-Rubio, Kumaradevan Punithakumar, Shizhan Gong, Qi Dou, Sina Amirrajab, Yasmina Al Khalil, Cian M. Scannell, Lexiaozi Fan, Huili Yang, Xiaowu Sun, Rob van der Geest, Tewodros Weldebirhan Arega, Fabrice Meriaudeau, Caner Ãzer, Amin Ranem, John Kalkhof, İlkay Ãksüz, Anirban Mukhopadhyay, Abdul Qayyum, Moona Mazher, Steven A Niederer, Carles Garcia-Cabrera, Eric Arazo, Michal K. Grzeszczyk, Szymon PÅotka, Wanqin Ma, Xiaomeng Li, Rongjun Ge, Yongqing Kou, Xinrong Chen, He Wang, Chengyan Wang, Wenjia Bai, Shuo Wang
Abstract:
Deep learning models have achieved state-of-the-art performance in automated Cardiac Magnetic Resonance (CMR) analysis. However, the efficacy of these models is highly dependent on the availability of high-quality, artifact-free images. In clinical practice, CMR acquisitions are frequently degraded by respiratory motion, yet the robustness of deep learning models against such artifacts remains an underexplored problem. To promote research in this domain, we organized the MICCAI CMRxMotion challenge. We curated and publicly released a dataset of 320 CMR cine series from 40 healthy volunteers who performed specific breathing protocols to induce a controlled spectrum of motion artifacts. The challenge comprised two tasks: 1) automated image quality assessment to classify images based on motion severity, and 2) robust myocardial segmentation in the presence of motion artifacts. A total of 22 algorithms were submitted and evaluated on the two designated tasks. This paper presents a comprehensive overview of the challenge design and dataset, reports the evaluation results for the top-performing methods, and further investigates the impact of motion artifacts on five clinically relevant biomarkers. All resources and code are publicly available at: https://github.com/CMRxMotion
中文:MICCAI CMRxMotion挑战赛通过发布包含受控运动伪影的心脏MRI数据集,评估了22种算法在图像质量分级与分割任务中的表现,旨在推动深度学习模型对呼吸运动伪影的鲁棒性研究,所有资源均已开源。
English: The MICCAI CMRxMotion challenge addressed the underexplored issue of deep learning robustness against respiratory motion artifacts in cardiac MRI by releasing a curated dataset and evaluating 22 algorithms on quality assessment and segmentation tasks, with all resources made publicly available.
Authors:Jie Chen, Zhangchi Hu, Peixi Wu, Huyue Zhu, Hebei Li, Xiaoyan Sun
Abstract:
Dynamic scene reconstruction is a long-term challenge in 3D vision. Existing plane-based methods in dynamic Gaussian splatting suffer from an unsuitable low-rank assumption, causing feature overlap and poor rendering quality. Although 4D hash encoding provides an explicit representation without low-rank constraints, directly applying it to the entire dynamic scene leads to substantial hash collisions and redundancy. To address these challenges, we present DASH, a real-time dynamic scene rendering framework that employs 4D hash encoding coupled with self-supervised decomposition. Our approach begins with a self-supervised decomposition mechanism that separates dynamic and static components without manual annotations or precomputed masks. Next, we introduce a multiresolution 4D hash encoder for dynamic elements, providing an explicit representation that avoids the low-rank assumption. Finally, we present a spatio-temporal smoothness regularization strategy to mitigate unstable deformation artifacts. Experiments on real-world datasets demonstrate that DASH achieves state-of-the-art dynamic rendering performance, exhibiting enhanced visual quality at real-time speeds of 264 FPS on a single 4090 GPU. Code: https://github.com/chenj02/DASH.
中文: DASH提出了一种结合4D哈希编码与自监督分解的实时动态场景渲染框架,有效解决了特征重叠和哈希冲突问题,在264 FPS下实现了最优的视觉质量。
English: DASH introduces a real-time dynamic scene rendering framework using 4D hash encoding with self-supervised decomposition to overcome feature overlap and hash collisions, achieving state-of-the-art visual quality at 264 FPS.
Authors:Tianyu Zou, Shengwu Xiong, Ruilin Yao, Yi Rong
Abstract:
This paper studies the few-shot segmentation (FSS) task, which aims to segment objects belonging to unseen categories in a query image by learning a model on a small number of well-annotated support samples. Our analysis of two mainstream FSS paradigms reveals that the predictions made by prototype learning methods are usually conservative, while those of affinity learning methods tend to be more aggressive. This observation motivates us to balance the conservative and aggressive information captured by these two types of FSS frameworks so as to improve the segmentation performance. To achieve this, we propose a **P**rototype-**A**ffinity **H**ybrid **Net**work (PAHNet), which introduces a Prototype-guided Feature Enhancement (PFE) module and an Attention Score Calibration (ASC) module in each attention block of an affinity learning model (called affinity learner). These two modules utilize the predictions generated by a pre-trained prototype learning model (called prototype predictor) to enhance the foreground information in support and query image representations and suppress the mismatched foreground-background (FG-BG) relationships between them, respectively. In this way, the aggressiveness of the affinity learner can be effectively mitigated, thereby eventually increasing the segmentation accuracy of our PAHNet method. Experimental results show that PAHNet outperforms most recently proposed methods across 1-shot and 5-shot settings on both PASCAL-5$^i$ and COCO-20$^i$ datasets, suggesting its effectiveness. The code is available at: [GitHub - tianyu-zou/PAHNet: Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation (ICCV'25)](https://github.com/tianyu-zou/PAHNet)
本文提出PAHNet混合网络,通过特征增强和注意力校准模块平衡保守的原型学习与激进的亲和力学习,从而提升小样本分割的准确性。
This paper introduces PAHNet, a hybrid few-shot segmentation model that balances conservative prototype learning and aggressive affinity learning through feature enhancement and attention calibration modules to improve segmentation accuracy.
Authors:Xuetian Chen, Yinghao Chen, Xinfeng Yuan, Zhuo Peng, Lu Chen, Yuekeng Li, Zhoujia Zhang, Yingqian Huang, Leyan Huang, Jiaqing Liang, Tianbao Xie, Zhiyong Wu, Qiushi Sun, Biqing Qi, Bowen Zhou
Abstract:
Computer-using agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands-hindering both targeted capability development and the reliable transition of research progress into practical deployment. To bridge the gap, we present OS-MAP, a benchmark for daily computer-using automation that organizes its 416 realistic tasks across 15 applications along two key dimensions: a five-level taxonomy of automation and a generalization scope derived from a real-world user demand hierarchy. To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-MAP evaluates agents along two dimensions: automation level across a five-level taxonomy, and generalization scope across a demand hierarchy. This design captures varying levels of required agent autonomy and generalization, forming a performance-generalization evaluation matrix for structured and comprehensive assessment. Experiments show that even State-of-the-Art agents with VLM backbones struggle with higher-level tasks involving perception, reasoning, and coordination-highlighting the need for a deeper understanding of current strengths and limitations to drive the future progress in computer-using agents research and deployment. All code, environments, baselines, and data are publicly available at https://github.com/OS-Copilot/OS-Map.
中文: 计算机使用代理虽能提升生产力,但现有基准未能匹配实际需求,因此我们推出OS-MAP基准,通过自动化分级和泛化范围评估代理能力,揭示其在高阶任务中的不足,以推动研究与应用发展。
English: Computer-using agents show promise for productivity but face challenges in aligning capabilities with real-world tasks, prompting the introduction of OS-MAP, a benchmark that evaluates agents across automation levels and generalization scopes to address these gaps and guide future development.
Authors:Yuqi Li, Haotian Zhang, Li Li, Dong Liu
Abstract:
Context modeling is essential in learned image compression for accurately estimating the distribution of latents. While recent advanced methods have expanded context modeling capacity, they still struggle to efficiently exploit long-range dependency and diverse context information across different coding steps. In this paper, we introduce a novel Hierarchical Progressive Context Model (HPCM) for more efficient context information acquisition. Specifically, HPCM employs a hierarchical coding schedule to sequentially model the contextual dependencies among latents at multiple scales, which enables more efficient long-range context modeling. Furthermore, we propose a progressive context fusion mechanism that incorporates contextual information from previous coding steps into the current step, effectively exploiting diverse contextual information. Experimental results demonstrate that our method achieves state-of-the-art rate-distortion performance and strikes a better balance between compression performance and computational complexity. The code is available at https://github.com/lyq133/LIC-HPCM.
Chinese: 本文提出的分层渐进上下文模型(HPCM)通过多尺度顺序建模和渐进融合机制,有效利用长程依赖和多样化上下文信息,在图像压缩中实现了领先的率失真性能和更优的计算复杂度平衡。
English: The paper introduces a Hierarchical Progressive Context Model (HPCM) that enhances image compression by efficiently capturing long-range dependencies and integrating contextual information across coding steps, achieving superior rate-distortion performance and computational balance.
Authors:Binxiong Li, Xu Xiang, Xue Li, Quanzhou Lou, Binyu Zhao, Yujie Liu, Huijie Tang, Benhan Yang
Abstract:
Attributed graph clustering holds significant importance in modern data analysis. However, due to the complexity of graph data and the heterogeneity of node attributes, leveraging graph information for clustering remains challenging. To address this, we propose a novel deep graph clustering model, GCL-GCN, specifically designed to address the limitations of existing models in capturing local dependencies and complex structures when dealing with sparse and heterogeneous graph data. GCL-GCN introduces an innovative Graphormer module that combines centrality encoding and spatial relationships, effectively capturing both global and local information between nodes, thereby enhancing the quality of node representations. Additionally, we propose a novel contrastive learning module that significantly enhances the discriminative power of feature representations. In the pre-training phase, this module increases feature distinction through contrastive learning on the original feature matrix, ensuring more identifiable initial representations for subsequent graph convolution and clustering tasks. Extensive experimental results on six datasets demonstrate that GCL-GCN outperforms 14 advanced methods in terms of clustering quality and robustness. Specifically, on the Cora dataset, it improves ACC, NMI, and ARI by 4.94%, 13.01%, and 10.97%, respectively, compared to the primary comparison method MBN.
中文: 提出的GCL-GCN模型通过结合捕捉节点局部与全局依赖的Graphormer模块和增强特征区分度的对比学习模块,改进了属性图聚类效果,在多个数据集上超越了现有方法的性能表现。
English: The proposed GCL-GCN model enhances attributed graph clustering by integrating a Graphormer module for capturing local and global node dependencies and a contrastive learning module to improve feature discrimination, achieving superior performance over existing methods across multiple datasets.
Authors:Antonio Tudisco, Deborah Volpe, Giacomo Orlandi, Giovanna Turvani
Abstract:
The growing variety of quantum hardware technologies, each with unique peculiarities such as connectivity and native gate sets, creates challenges when selecting the best platform for executing a specific quantum circuit. This selection process usually involves a brute-force approach: compiling the circuit on various devices and evaluating performance based on factors such as circuit depth and gate fidelity. However, this method is computationally expensive and does not scale well as the number of available quantum processors increases. In this work, we propose a Graph Neural Network (GNN)-based predictor that automates hardware selection by analyzing the Directed Acyclic Graph (DAG) representation of a quantum circuit. Our study evaluates 498 quantum circuits (up to 27 qubits) from the MQT Bench dataset, compiled using Qiskit on four devices: three superconducting quantum processors (IBM-Kyiv, IBM-Brisbane, IBM-Sherbrooke) and one trapped-ion processor (IONQ-Forte). Performance is estimated using a metric that integrates circuit depth and gate fidelity, resulting in a dataset where 93 circuits are optimally compiled on the trapped-ion device, while the remaining circuits prefer superconducting platforms. By exploiting graph-based machine learning, our approach avoids extracting the circuit features for the model evaluation but directly embeds it as a graph, significantly accelerating the optimal target decision-making process and maintaining all the information. Experimental results prove 94.4% accuracy and an 85.5% F1 score for the minority class, effectively predicting the best compilation target. The developed code is publicly available on GitHub (https://github.com/antotu/GNN-Model-Quantum-Predictor).
中文: 本研究提出了一种基于图神经网络的预测器,通过分析量子电路的图结构来自动选择最优硬件,相比传统暴力编译方法显著加速决策过程,并以94.4%的准确率成功预测最佳编译目标。
English: This study introduces a Graph Neural Network (GNN)-based predictor that automates quantum hardware selection by analyzing circuit graphs, achieving 94.4% accuracy in identifying optimal devices and significantly speeding up decision-making compared to traditional brute-force compilation methods.
Authors:Shuhao Li, Weidong Yang, Yue Cui, Xiaoxing Liu, Lingkai Meng, Lipeng Ma, Fan Zhang
Abstract:
Fine-grained traffic management and prediction are fundamental to key applications such as autonomous driving, lane change guidance, and traffic signal control. However, obtaining lane-level traffic data has become a critical bottleneck for data-driven models due to limitations in the types and number of sensors and issues with the accuracy of tracking algorithms. To address this, we propose the Fine-grained Road Traffic Inference (FRTI) task, which aims to generate more detailed lane-level traffic information using limited road data, providing a more energy-efficient and cost-effective solution for precise traffic management. This task is abstracted as the first scene of the spatio-temporal graph node generation problem. We designed a two-stage framework--RoadDiff--to solve the FRTI task. solve the FRTI task. This framework leverages the Road-Lane Correlation Autoencoder-Decoder and the Lane Diffusion Module to fully utilize the limited spatio-temporal dependencies and distribution relationships of road data to accurately infer fine-grained lane traffic states. Based on existing research, we designed several baseline models with the potential to solve the FRTI task and conducted extensive experiments on six datasets representing different road conditions to validate the effectiveness of the RoadDiff model in addressing the FRTI task. The relevant datasets and code are available at https://github.com/ShuhaoLii/RoadDiff.
Chinese: 本研究提出了细粒度道路交通推断(FRTI)任务,并设计了两阶段RoadDiff框架,通过利用有限道路数据高效生成详细车道级交通信息,在六个数据集上的大量实验验证了其有效性。
English: The study introduces the Fine-grained Road Traffic Inference (FRTI) task and proposes a two-stage RoadDiff framework to generate detailed lane-level traffic data efficiently using limited road information, validated through extensive experiments on six datasets.
Authors:Rui Pan, Ruiying Lu
Abstract:
Radiography imaging protocols target on specific anatomical regions, resulting in highly consistent images with recurrent structural patterns across patients. Recent advances in medical anomaly detection have demonstrated the effectiveness of CNN- and transformer-based approaches. However, CNNs exhibit limitations in capturing long-range dependencies, while transformers suffer from quadratic computational complexity. In contrast, Mamba-based models, leveraging superior long-range modeling, structural feature extraction, and linear computational efficiency, have emerged as a promising alternative. To capitalize on the inherent structural regularity of medical images, this study introduces SP-Mamba, a spatial-perception Mamba framework for unsupervised medical anomaly detection. The window-sliding prototype learning and Circular-Hilbert scanning-based Mamba are introduced to better exploit consistent anatomical patterns and leverage spatial information for medical anomaly detection. Furthermore, we excavate the concentration and contrast characteristics of anomaly maps for improving anomaly detection. Extensive experiments on three diverse medical anomaly detection benchmarks confirm the proposed method's state-of-the-art performance, validating its efficacy and robustness. The code is available at https://github.com/Ray-RuiPan/SP-Mamba.
中文摘要:本研究提出SP-Mamba空间感知框架,通过利用医学图像的结构规律性和空间信息进行无监督异常检测,在多个基准测试中实现了最先进的性能。
English Summary: This study introduces SP-Mamba, a spatial-perception Mamba framework that leverages structural regularity and spatial information for unsupervised medical anomaly detection, achieving state-of-the-art performance across multiple benchmarks.
Authors:Shuiqing Zhao, Meihuan Wang, Jiaxuan Xu, Jie Feng, Wei Qian, Rongchang Chen, Zhenyu Liang, Shouliang Qi, Yanan Wu
Abstract:
Background: It is fundamental for accurate segmentation and quantification of the pulmonary vessel, particularly smaller vessels, from computed tomography (CT) images in chronic obstructive pulmonary disease (COPD) patients. Objective: The aim of this study was to segment the pulmonary vasculature using a semi-supervised method. Methods: In this study, a self-training framework is proposed by leveraging a teacher-student model for the segmentation of pulmonary vessels. First, the high-quality annotations are acquired in the in-house data by an interactive way. Then, the model is trained in the semi-supervised way. A fully supervised model is trained on a small set of labeled CT images, yielding the teacher model. Following this, the teacher model is used to generate pseudo-labels for the unlabeled CT images, from which reliable ones are selected based on a certain strategy. The training of the student model involves these reliable pseudo-labels. This training process is iteratively repeated until an optimal performance is achieved. Results: Extensive experiments are performed on non-enhanced CT scans of 125 COPD patients. Quantitative and qualitative analyses demonstrate that the proposed method, Semi2, significantly improves the precision of vessel segmentation by 2.3%, achieving a precision of 90.3%. Further, quantitative analysis is conducted in the pulmonary vessel of COPD, providing insights into the differences in the pulmonary vessel across different severity of the disease. Conclusion: The proposed method can not only improve the performance of pulmonary vascular segmentation, but can also be applied in COPD analysis. The code will be made available at https://github.com/wuyanan513/semi-supervised-learning-for-vessel-segmentation.
中文: 本研究提出一种名为Semi2的半监督师生模型,将COPD患者肺血管分割精度提升2.3%至90.3%,并实现了疾病严重程度的血管差异分析。
English: This study introduces a semi-supervised teacher-student model called Semi2, which improves pulmonary vessel segmentation precision by 2.3% to 90.3% in COPD patients and enables disease severity analysis.
Authors:Haochen Han, Alex Jinpeng Wang, Fangming Liu, Jun Zhu
Abstract:
In this paper, we study a practical but less-touched problem in Vision-Language Models (VLMs), \ie, negation understanding. Specifically, many real-world applications require models to explicitly identify what is false or non-existent, \eg, radiologists may search for images that exclude specific conditions. Despite the impressive transferability of VLMs through large-scale training, they suffer from a critical limitation that fails to handle negation. To address this challenge, existing methods attribute its root cause to the scarcity of negation training data and propose to fine-tune VLMs on massive data containing explicit negation. Undoubtedly, such data-centric solutions demand substantial data and computational resources, limiting their sustainable widespread adoption. To tackle negation in a low-carbon manner, we empirically observe that the key obstacle lies in the dual-concept shifts between the affirmation and negation distributions. Therefore, we propose a Negation-Aware Test-Time Adaptation (NEAT) method to efficiently adjust distribution-related parameters during inference. In brief, NEAT can reduce distribution shift in consistent semantics while eliminating false distributional consistency in unrelated semantics. Extensive experiments on the various negation understanding tasks verify the effectiveness of the proposed method. Remarkably, with less than 0.01\% of trainable parameters, NEAT achieves comparable or superior performance to state-of-the-art post-training approaches. Our code is available at https://github.com/hhc1997/NEAT.
Chinese: 本文提出NEAT方法,通过推理时调整分布相关参数,以低碳方式解决视觉语言模型中的否定理解问题,仅用极少可训练参数即达到最优性能。
English: This paper introduces NEAT, a low-carbon test-time adaptation method that efficiently addresses negation understanding in Vision-Language Models by adjusting distribution-related parameters during inference, achieving state-of-the-art performance with minimal computational resources.
Authors:Yufei Ma, Hanwen Zhang, Qiya Yang, Guibo Luo, Yuesheng Zhu
Abstract:
In multi-center scenarios, One-Shot Federated Learning (OSFL) has attracted increasing attention due to its low communication overhead, requiring only a single round of transmission. However, existing generative model-based OSFL methods suffer from low training efficiency and potential privacy leakage in the healthcare domain. Additionally, achieving convergence within a single round of model aggregation is challenging under non-Independent and Identically Distributed (non-IID) data. To address these challenges, in this paper a modified OSFL framework is proposed, in which a new Feature-Guided Rectified Flow Model (FG-RF) and Dual-Layer Knowledge Distillation (DLKD) aggregation method are developed. FG-RF on the client side accelerates generative modeling in medical imaging scenarios while preserving privacy by synthesizing feature-level images rather than pixel-level images. To handle non-IID distributions, DLKD enables the global student model to simultaneously mimic the output logits and align the intermediate-layer features of client-side teacher models during aggregation. Experimental results on three non-IID medical imaging datasets show that our new framework and method outperform multi-round federated learning approaches, achieving up to 21.73% improvement, and exceeds the baseline FedISCA by an average of 21.75%. Furthermore, our experiments demonstrate that feature-level synthetic images significantly reduce privacy leakage risks compared to pixel-level synthetic images. The code is available at https://github.com/LMIAPC/one-shot-fl-medical.
Chinese: 本文提出了一种改进的单次联邦学习框架,采用特征引导整流流模型实现高效且保护隐私的医学图像生成,并通过双层知识蒸馏方法处理非独立同分布数据,在三个医学影像数据集上相比现有方法性能提升最高达21.73%。
English: This paper introduces a modified one-shot federated learning framework featuring a Feature-Guided Rectified Flow Model for efficient, privacy-preserving medical image generation and a Dual-Layer Knowledge Distillation method to handle non-IID data, achieving superior performance with up to 21.73% improvement over existing approaches.
Authors:Ying Ba, Tianyu Zhang, Yalong Bai, Wenyi Mo, Tao Liang, Bing Su, Ji-Rong Wen
Abstract:
Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality while maintaining text-image alignment. Experiments demonstrate that the proposed evaluation model improves scoring accuracy by over 10\% compared to existing methods, and achieves significant results in optimizing state-of-the-art text-to-image models. This research provides theoretical and empirical support for evolving image generation technology toward higher-order human aesthetic preferences. Code is available at https://github.com/BarretBa/ICTHP.
中文: 针对现有图像生成系统评估方法的不足,本研究提出了ICT和HP评分模型,通过优化图文对齐和美学质量评估,显著提升了评分准确性并推动图像生成技术向更高审美层次发展。
English: Current image generation systems produce high-quality images but are poorly evaluated by outdated metrics, leading this study to introduce the ICT and HP scores for better alignment with human aesthetic preferences and improved evaluation accuracy.
Authors:Yongsong Huang, Tomo Miyazaki, Xiaofeng Liu, Shinichiro Omachi
Abstract:
Infrared Image Super-Resolution (IRSR) is challenged by the low contrast and sparse textures of infrared data, requiring robust long-range modeling to maintain global coherence. While State-Space Models like Mamba offer proficiency in modeling long-range dependencies for this task, their inherent 1D causal scanning mechanism fragments the global context of 2D images, hindering fine-detail restoration. To address this, we propose Global Phase and Spectral Prompt-guided Mamba (GPSMamba), a framework that synergizes architectural guidance with non-causal supervision. First, our Adaptive Semantic-Frequency State Space Module (ASF-SSM) injects a fused semantic-frequency prompt directly into the Mamba block, integrating non-local context to guide reconstruction. Then, a novel Thermal-Spectral Attention and Phase Consistency Loss provides explicit, non-causal supervision to enforce global structural and spectral fidelity. By combining these two innovations, our work presents a systematic strategy to mitigate the limitations of causal modeling. Extensive experiments demonstrate that GPSMamba achieves state-of-the-art performance, validating our approach as a powerful new paradigm for infrared image restoration. Code is available at https://github.com/yongsongH/GPSMamba.
中文: 提出的GPSMamba框架通过融合自适应语义频率提示和热光谱监督,克服了红外图像超分辨率中一维因果建模的局限,实现了通过增强全局一致性和结构保真度的最先进性能。
English: The proposed GPSMamba framework overcomes the limitations of 1D causal modeling in infrared image super-resolution by integrating adaptive semantic-frequency prompts and thermal-spectral supervision, achieving state-of-the-art performance through enhanced global coherence and structural fidelity.
Authors:Zixiang Ai, Zhenyu Cui, Yuxin Peng, Jiahuan Zhou
Abstract:
Pre-trained point cloud analysis models have shown promising advancements in various downstream tasks, yet their effectiveness is typically suffering from low-quality point cloud (i.e., noise and incompleteness), which is a common issue in real scenarios due to casual object occlusions and unsatisfactory data collected by 3D sensors. To this end, existing methods focus on enhancing point cloud quality by developing dedicated denoising and completion models. However, due to the isolation between the point cloud enhancement and downstream tasks, these methods fail to work in various real-world domains. In addition, the conflicting objectives between denoising and completing tasks further limit the ensemble paradigm to preserve critical geometric features. To tackle the above challenges, we propose a unified point-level prompting method that reformulates point cloud denoising and completion as a prompting mechanism, enabling robust analysis in a parameter-efficient manner. We start by introducing a Rectification Prompter to adapt to noisy points through the predicted rectification vector prompts, effectively filtering noise while preserving intricate geometric features essential for accurate analysis. Sequentially, we further incorporate a Completion Prompter to generate auxiliary point prompts based on the rectified point clouds, facilitating their robustness and adaptability. Finally, a Shape-Aware Unit module is exploited to efficiently unify and capture the filtered geometric features for the downstream point cloud analysis.Extensive experiments on four datasets demonstrate the superiority and robustness of our method when handling noisy and incomplete point cloud data against existing state-of-the-art methods. Our code is released at https://github.com/zhoujiahuan1991/ICCV2025-UPP.
中文: 本文提出了一种统一的点级提示方法,通过整合校正和补全提示器来高效处理噪声和不完整数据,从而增强预训练点云分析的鲁棒性,并在多个数据集上展现出卓越性能。
English: This paper introduces a unified point-level prompting method that enhances pre-trained point cloud analysis by integrating rectification and completion prompters to handle noisy and incomplete data efficiently, demonstrating superior performance across multiple datasets.
Authors:Zeyi Lu, Xiaoxiao Ma, Yujun Huang, Minxiao Chen, Bin Chen, Baoyi An, Shu-Tao Xia
Abstract:
The explosive growth of multi-source multimedia data has significantly increased the demands for transmission and storage, placing substantial pressure on bandwidth and storage infrastructures. While Autoregressive Compression Models (ACMs) have markedly improved compression efficiency through probabilistic prediction, current approaches remain constrained by two critical limitations: suboptimal compression ratios due to insufficient fine-grained feature extraction during probability modeling, and real-time processing bottlenecks caused by high resource consumption and low compression speeds. To address these challenges, we propose Efficient Dual-path Parallel Compression (EDPC), a hierarchically optimized compression framework that synergistically enhances modeling capability and execution efficiency via coordinated dual-path operations. At the modeling level, we introduce the Information Flow Refinement (IFR) metric grounded in mutual information theory, and design a Multi-path Byte Refinement Block (MBRB) to strengthen cross-byte dependency modeling via heterogeneous feature propagation. At the system level, we develop a Latent Transformation Engine (LTE) for compact high-dimensional feature representation and a Decoupled Pipeline Compression Architecture (DPCA) to eliminate encoding-decoding latency through pipelined parallelization. Experimental results demonstrate that EDPC achieves comprehensive improvements over state-of-the-art methods, including a 2.7x faster compression speed, and a 3.2% higher compression ratio. These advancements establish EDPC as an efficient solution for real-time processing of large-scale multimedia data in bandwidth-constrained scenarios. Our code is available at https://github.com/Magie0/EDPC.
中文摘要:提出的高效双路并行压缩(EDPC)框架通过改进特征提取和处理效率,解决了自回归压缩模型在细粒度特征建模和实时处理方面的局限,实现了压缩速度与压缩率的显著提升。
English Summary: The proposed Efficient Dual-path Parallel Compression (EDPC) framework overcomes limitations in autoregressive compression models by enhancing feature extraction and processing efficiency, achieving significantly faster compression speeds and higher compression ratios.
Authors:Jionghao Wang, Cheng Lin, Yuan Liu, Rui Xu, Zhiyang Dou, Xiao-Xiao Long, Hao-Xiang Guo, Taku Komura, Wenping Wang, Xin Li
Abstract:
Point-based representations have consistently played a vital role in geometric data structures. Most point cloud learning and processing methods typically leverage the unordered and unconstrained nature to represent the underlying geometry of 3D shapes. However, how to extract meaningful structural information from unstructured point cloud distributions and transform them into semantically meaningful point distributions remains an under-explored problem. We present PDT, a novel framework for point distribution transformation with diffusion models. Given a set of input points, PDT learns to transform the point set from its original geometric distribution into a target distribution that is semantically meaningful. Our method utilizes diffusion models with novel architecture and learning strategy, which effectively correlates the source and the target distribution through a denoising process. Through extensive experiments, we show that our method successfully transforms input point clouds into various forms of structured outputs - ranging from surface-aligned keypoints, and inner sparse joints to continuous feature lines. The results showcase our framework's ability to capture both geometric and semantic features, offering a powerful tool for various 3D geometry processing tasks where structured point distributions are desired. Code will be available at this link: https://github.com/shanemankiw/PDT.
中文: PDT框架利用扩散模型将无结构的点云转换为具有语义意义的分布,有效捕捉几何和结构特征,以提升三维几何处理能力。
English: The PDT framework employs diffusion models to transform unstructured point clouds into semantically meaningful distributions, effectively capturing both geometric and structural features for enhanced 3D geometry processing.
Authors:Lei Zhang, Xin Zhou, Chaoyue He, Di Wang, Yi Wu, Hong Xu, Wei Liu, Chunyan Miao
Abstract:
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency. However, these documents are often lengthy, structurally diverse, and multimodal, comprising dense text, structured tables, complex figures, and layout-dependent semantics. Existing AI systems often struggle to perform reliable document-level reasoning in such settings, and no dedicated benchmark currently exists in ESG domain. To fill the gap, we introduce \textbf{MMESGBench}, a first-of-its-kind benchmark dataset targeted to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents. This dataset is constructed via a human-AI collaborative, multi-stage pipeline. First, a multimodal LLM generates candidate question-answer (QA) pairs by jointly interpreting rich textual, tabular, and visual information from layout-aware document pages. Second, an LLM verifies the semantic accuracy, completeness, and reasoning complexity of each QA pair. This automated process is followed by an expert-in-the-loop validation, where domain specialists validate and calibrate QA pairs to ensure quality, relevance, and diversity. MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories. Questions are categorized as single-page, cross-page, or unanswerable, with each accompanied by fine-grained multimodal evidence. Initial experiments validate that multimodal and retrieval-augmented models substantially outperform text-only baselines, particularly on visually grounded and cross-page tasks. MMESGBench is publicly available as an open-source dataset at https://github.com/Zhanglei1103/MMESGBench.
中文: MMESGBench是首个针对多样化ESG文件的多模态基准数据集,通过人机协作构建流程评估AI系统的跨模态理解和复杂推理能力,实验证明多模态模型在视觉依赖和跨页任务上显著优于纯文本方法。
English: MMESGBench is a pioneering multimodal benchmark dataset designed to evaluate AI systems' understanding and reasoning across diverse ESG documents, addressing current limitations through a human-AI collaborative creation process and demonstrating superior performance of multimodal models over text-only approaches.
Authors:Jian Chen, Yuxuan Hu, Haifeng Lu, Wei Wang, Min Yang, Chengming Li, Xiping Hu
Abstract:
Although pre-trained visual models with text have demonstrated strong capabilities in visual feature extraction, sticker emotion understanding remains challenging due to its reliance on multi-view information, such as background knowledge and stylistic cues. To address this, we propose a novel multi-granularity hierarchical fusion transformer (MGHFT), with a multi-view sticker interpreter based on Multimodal Large Language Models. Specifically, inspired by the human ability to interpret sticker emotions from multiple views, we first use Multimodal Large Language Models to interpret stickers by providing rich textual context via multi-view descriptions. Then, we design a hierarchical fusion strategy to fuse the textual context into visual understanding, which builds upon a pyramid visual transformer to extract both global and local sticker features at multiple stages. Through contrastive learning and attention mechanisms, textual features are injected at different stages of the visual backbone, enhancing the fusion of global- and local-granularity visual semantics with textual guidance. Finally, we introduce a text-guided fusion attention mechanism to effectively integrate the overall multimodal features, enhancing semantic understanding. Extensive experiments on 2 public sticker emotion datasets demonstrate that MGHFT significantly outperforms existing sticker emotion recognition approaches, achieving higher accuracy and more fine-grained emotion recognition. Compared to the best pre-trained visual models, our MGHFT also obtains an obvious improvement, 5.4% on F1 and 4.0% on accuracy. The code is released at https://github.com/cccccj-03/MGHFT_ACMMM2025.
中文摘要:提出的多粒度分层融合Transformer(MGHFT)通过多模态大语言模型实现表情包的多视角解读,并采用分层策略将文本信息与多阶段视觉特征融合,在表情情感识别任务中显著优于现有方法。
English Summary: The proposed Multi-Granularity Hierarchical Fusion Transformer (MGHFT) leverages multimodal large language models to interpret stickers through multi-view descriptions and hierarchically fuses textual context with visual features, achieving superior performance in sticker emotion recognition compared to existing methods.
Authors:Zhihao Luo, Luojun Lin, Zheng Lin
Abstract:
Due to the high cost of collection and labeling, there are relatively few datasets for camouflaged object detection (COD). In particular, for certain specialized categories, the available image dataset is insufficiently populated. Synthetic datasets can be utilized to alleviate the problem of limited data to some extent. However, directly training with synthetic datasets compared to real datasets can lead to a degradation in model performance. To tackle this problem, in this work, we investigate a new task, namely Syn-to-Real Camouflaged Object Detection (S2R-COD). In order to improve the model performance in real world scenarios, a set of annotated synthetic camouflaged images and a limited number of unannotated real images must be utilized. We propose the Cycling Syn-to-Real Domain Adaptation Framework (CSRDA), a method based on the student-teacher model. Specially, CSRDA propagates class information from the labeled source domain to the unlabeled target domain through pseudo labeling combined with consistency regularization. Considering that narrowing the intra-domain gap can improve the quality of pseudo labeling, CSRDA utilizes a recurrent learning framework to build an evolving real domain for bridging the source and target domain. Extensive experiments demonstrate the effectiveness of our framework, mitigating the problem of limited data and handcraft annotations in COD. Our code is publicly available at: https://github.com/Muscape/S2R-COD.
中文: 本研究提出合成到真实伪装目标检测(S2R-COD)新任务,并开发了CSRDA框架,通过域适应方法结合合成数据与未标注真实图像,显著提升了模型在真实场景中的性能,有效缓解了伪装检测领域数据匮乏的问题。
English: This study introduces the Syn-to-Real Camouflaged Object Detection (S2R-COD) task and proposes the CSRDA framework, which uses synthetic data and unlabeled real images with domain adaptation to enhance model performance in real-world scenarios, effectively addressing data scarcity in COD.
Authors:Zhihao Luo, Luojun Lin, Zheng Lin
Abstract:
Due to the high cost of collection and labeling, there are relatively few datasets for camouflaged object detection (COD). In particular, for certain specialized categories, the available image dataset is insufficiently populated. Synthetic datasets can be utilized to alleviate the problem of limited data to some extent. However, directly training with synthetic datasets compared to real datasets can lead to a degradation in model performance. To tackle this problem, in this work, we investigate a new task, namely Syn-to-Real Camouflaged Object Detection (S2R-COD). In order to improve the model performance in real world scenarios, a set of annotated synthetic camouflaged images and a limited number of unannotated real images must be utilized. We propose the Cycling Syn-to-Real Domain Adaptation Framework (CSRDA), a method based on the student-teacher model. Specially, CSRDA propagates class information from the labeled source domain to the unlabeled target domain through pseudo labeling combined with consistency regularization. Considering that narrowing the intra-domain gap can improve the quality of pseudo labeling, CSRDA utilizes a recurrent learning framework to build an evolving real domain for bridging the source and target domain. Extensive experiments demonstrate the effectiveness of our framework, mitigating the problem of limited data and handcraft annotations in COD. Our code is publicly available at: https://github.com/Muscape/S2R-COD.
中文: 本研究提出合成到真实伪装目标检测(S2R-COD)新任务,并开发了CSRDA框架,通过域适应方法结合合成数据与未标注真实图像,显著提升了模型在真实场景中的性能,有效缓解了伪装检测领域数据匮乏的问题。
English: This study introduces the Syn-to-Real Camouflaged Object Detection (S2R-COD) task and proposes the CSRDA framework, which uses synthetic data and unlabeled real images with domain adaptation to enhance model performance in real-world scenarios, effectively addressing data scarcity in COD.
Authors:Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, Daniel Kang
Abstract:
Assessing the reproducibility of social science papers is essential for promoting rigor in research processes, but manual assessment is costly. With recent advances in agentic AI systems (i.e., AI agents), we seek to evaluate their capability to automate this process. However, existing benchmarks for reproducing research papers (1) focus solely on reproducing results using provided code and data without assessing their consistency with the paper, (2) oversimplify real-world scenarios, and (3) lack necessary diversity in data formats and programming languages. To address these issues, we introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report. The agents are tasked with assessing the reproducibility of the paper based on the original paper PDF and the corresponding reproduction package. REPRO-Bench features end-to-end evaluation tasks on the reproducibility of social science papers with complexity comparable to real-world assessments. We evaluate three representative AI agents on REPRO-Bench, with the best-performing agent achieving an accuracy of only 21.4%. Building on our empirical analysis, we develop REPRO-Agent, which improves the highest accuracy achieved by existing agents by 71%. We conclude that more advanced AI agents should be developed to automate real-world reproducibility assessment. REPRO-Bench is publicly available at https://github.com/uiuc-kang-lab/REPRO-Bench.
中文: 本研究提出REPRO-Bench评估AI代理在社会科学研究中自动化可复现性评估的能力,发现现有最佳代理仅达21.4%准确率,并通过REPRO-Agent将性能提升71%,表明需开发更先进的AI系统。
English: This study introduces REPRO-Bench to evaluate AI agents' ability to automate reproducibility assessments in social science research, revealing current limitations with top agents achieving only 21.4% accuracy while proposing REPRO-Agent that improves performance by 71%.
Authors:Rongkun Xue, Yazhe Niu, Shuai Hu, Zixin Yin, Yongqiang Yao, Jing Yang
Abstract:
Discrete speech tokenization is a fundamental component in speech codecs. However, in large-scale speech-to-speech systems, the complexity of parallel streams from multiple quantizers and the computational cost of high-time-dimensional codecs pose significant challenges. In this paper, we introduce HH-Codec, a neural codec that achieves extreme compression at 24 tokens per second for 24 kHz audio while relying on single-quantizer inference. Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss. Building on this, we propose an asymmetric encoder-decoder architecture (Audio-VQ-Mel-Audio) that leverages dual supervision and progressive training to enhance reconstruction stability and fidelity. HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps. We further evaluate its effectiveness in codebook utilization and generative model adaptation, with extensive ablations validating the necessity of each module. HH-Codec is available at https://github.com/opendilab/HH-Codec.
Chinese: HH-Codec 提出了一种采用单量化器设计和非对称编码器-解码器架构的神经语音编解码器,以每秒24个令牌和0.3 kbps的超低带宽实现了最先进的语音压缩与高保真重建。
English: HH-Codec introduces a neural speech codec using a single-quantizer design and an asymmetric encoder-decoder architecture, achieving state-of-the-art compression at 24 tokens per second with 0.3 kbps bandwidth for high-fidelity speech reconstruction.
Authors:Beidi Zhao, SangMook Kim, Hao Chen, Chen Zhou, Zu-hua Gao, Gang Wang, Xiaoxiao Li
Abstract:
Multiple Instance Learning (MIL) has advanced WSI analysis but struggles with the complexity and heterogeneity of WSIs. Existing MIL methods face challenges in aggregating diverse patch information into robust WSI representations. While ViTs and clustering-based approaches show promise, they are computationally intensive and fail to capture task-specific and slide-specific variability. To address these limitations, we propose PTCMIL, a novel Prompt Token Clustering-based ViT for MIL aggregation. By introducing learnable prompt tokens into the ViT backbone, PTCMIL unifies clustering and prediction tasks in an end-to-end manner. It dynamically aligns clustering with downstream tasks, using projection-based clustering tailored to each WSI, reducing complexity while preserving patch heterogeneity. Through token merging and prototype-based pooling, PTCMIL efficiently captures task-relevant patterns. Extensive experiments on eight datasets demonstrate its superior performance in classification and survival analysis tasks, outperforming state-of-the-art methods. Systematic ablation studies confirm its robustness and strong interpretability. The code is released at https://github.com/ubc-tea/PTCMIL.
中文: PTCMIL提出了一种基于提示令牌聚类的视觉变换器,通过动态对齐聚类与下游任务,在降低计算复杂度的同时有效捕获任务相关模式,在WSI分析中展现出卓越性能。
English: PTCMIL introduces a prompt token clustering-based Vision Transformer that dynamically aligns clustering with downstream tasks, achieving superior performance in WSI analysis through efficient task-relevant pattern capture and reduced computational complexity.
Authors:Fabio De Sousa Ribeiro, Omar Todd, Charles Jones, Avinash Kori, Raghav Mehta, Ben Glocker
Abstract:
We introduce the Flow Stochastic Segmentation Network (Flow-SSN), a generative segmentation model family featuring discrete-time autoregressive and modern continuous-time flow variants. We prove fundamental limitations of the low-rank parameterisation of previous methods and show that Flow-SSNs can estimate arbitrarily high-rank pixel-wise covariances without assuming the rank or storing the distributional parameters. Flow-SSNs are also more efficient to sample from than standard diffusion-based segmentation models, thanks to most of the model capacity being allocated to learning the base distribution of the flow, constituting an expressive prior. We apply Flow-SSNs to challenging medical imaging benchmarks and achieve state-of-the-art results. Code available: https://github.com/biomedia-mira/flow-ssn.
中文摘要:Flow-SSN模型系列通过离散和连续时间变体克服了先前方法的秩限制,同时借助表达能力强的先验分布实现高效采样,在医学影像基准测试中取得了最先进的结果。
English Summary: The Flow-SSN model family introduces generative segmentation with discrete and continuous-time variants that overcome previous methods' rank limitations while enabling efficient sampling through an expressive prior, achieving state-of-the-art results on medical imaging benchmarks.
Authors:Maksymilian Wojnar
Abstract:
Recent advances in generative neural networks, particularly flow matching (FM), have enabled the generation of high-fidelity samples while significantly reducing computational costs. A promising application of these models is accelerating simulations in high-energy physics (HEP), helping research institutions meet their increasing computational demands. In this work, we leverage FM to develop surrogate models for fast simulations of zero degree calorimeters in the ALICE experiment. We present an effective training strategy that enables the training of fast generative models with an exceptionally low number of parameters. This approach achieves state-of-the-art simulation fidelity for both neutron (ZN) and proton (ZP) detectors, while offering substantial reductions in computational costs compared to existing methods. Our FM model achieves a Wasserstein distance of 1.27 for the ZN simulation with an inference time of 0.46 ms per sample, compared to the current best of 1.20 with an inference time of approximately 109 ms. The latent FM model further improves the inference speed, reducing the sampling time to 0.026 ms per sample, with a minimal trade-off in accuracy. Similarly, our approach achieves a Wasserstein distance of 1.30 for the ZP simulation, outperforming the current best of 2.08. The source code is available at https://github.com/m-wojnar/faster_zdc.
Chinese: 本研究利用流匹配技术为ALICE实验中的零度量能器开发了高效的替代模型,在显著降低计算成本和加快推理速度的同时,实现了最先进的模拟保真度。
English: This study utilizes flow matching to create efficient surrogate models for simulating zero degree calorimeters in the ALICE experiment, achieving state-of-the-art fidelity with significantly reduced computational costs and faster inference times.
Authors:Miguel Saavedra-Ruiz, Samer B. Nashed, Charlie Gauthier, Liam Paull
Abstract:
Many robotic systems require extended deployments in complex, dynamic environments. In such deployments, parts of the environment may change between subsequent robot observations. Most robotic mapping or environment modeling algorithms are incapable of representing dynamic features in a way that enables predicting their future state. Instead, they opt to filter certain state observations, either by removing them or some form of weighted averaging. This paper introduces Perpetua, a method for modeling the dynamics of semi-static features. Perpetua is able to: incorporate prior knowledge about the dynamics of the feature if it exists, track multiple hypotheses, and adapt over time to enable predicting of future feature states. Specifically, we chain together mixtures of "persistence" and "emergence" filters to model the probability that features will disappear or reappear in a formal Bayesian framework. The approach is an efficient, scalable, general, and robust method for estimating the states of features in an environment, both in the present as well as at arbitrary future times. Through experiments on simulated and real-world data, we find that Perpetua yields better accuracy than similar approaches while also being online adaptable and robust to missing observations.
中文: 本文提出Perpetua方法,通过串联持续性与出现性滤波器对半静态环境特征进行贝叶斯动态建模,实验证明该方法在预测未来状态时具有更高准确性、在线适应性和鲁棒性。
English: This paper introduces Perpetua, a Bayesian method for modeling semi-static environmental features by chaining persistence and emergence filters to predict their future states, demonstrating superior accuracy and adaptability in experiments.
Authors:Yilun Yang, Yekun Chai
Abstract:
Code-mixing, the practice of switching between languages within a conversation, poses unique challenges for traditional NLP. Existing benchmarks are limited by their narrow language pairs and tasks, failing to adequately assess large language models' (LLMs) code-mixing abilities. Despite the recognized importance of code-mixing for multilingual users, research on LLMs in this context remains sparse. Additionally, current techniques for synthesizing code-mixed data are underdeveloped to generate code-mixing. In response, we introduce CodeMixBench, a comprehensive benchmark covering eight tasks, including three specific to LLMs and five traditional NLP tasks, and 18 languages across seven language families. We also propose a new method for generating large-scale synthetic code-mixed texts by combining word substitution with GPT-4 prompting. Our evaluation reveals consistent underperformance of LLMs on code-mixed datasets involving different language families. Enhancements in training data size, model scale, and few-shot learning could improve their performance. The code and dataset are available at https://github.com/Jeromeyluck/CodeMixBench.
Chinese: CodeMixBench通过涵盖多种语言对和任务的综合基准评估大语言模型的语码混合能力,揭示了其表现不足的问题,并提出了改进的合成数据生成方法。
English: CodeMixBench addresses the limitations of existing benchmarks by evaluating LLMs' code-mixing abilities across diverse language pairs and tasks, revealing their consistent underperformance and proposing improved synthetic data generation methods.
Authors:Yiguo He, Xinjun Cheng, Junjie Zhu, Chunping Qiu, Jun Wang, Xichuan Zhang, Qiangjuan Huang, Ke Yang
Abstract:
Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-TEXT, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-TEXT dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 12.97% and 10.0% on the OSdataset_512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves significant improvements over the original CoCa models in terms of BLEU-4, SPICE, and CIDEr scores. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets. All code, pretrained models, and the SAR-Text dataset are publicly available at: https://github.com/YiguoHe/SAR-TEXT.
中文: 本文提出了SAR-TEXT,一个包含超过13万对SAR图像-文本的大规模高质量数据集,通过SAR-Narrator框架生成,显著提升了遥感任务中的检索、描述和视觉问答性能。
English: This paper introduces SAR-TEXT, a large-scale dataset of over 130,000 SAR image-text pairs generated via the SAR-Narrator framework, which significantly enhances performance in remote sensing tasks like retrieval, captioning, and VQA.
Authors:Yiguo He, Xinjun Cheng, Junjie Zhu, Chunping Qiu, Jun Wang, Xichuan Zhang, Qiangjuan Huang, Ke Yang
Abstract:
Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-TEXT, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-TEXT dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 12.97% and 10.0% on the OSdataset_512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves significant improvements over the original CoCa models in terms of BLEU-4, SPICE, and CIDEr scores. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets. All code, pretrained models, and the SAR-Text dataset are publicly available at: https://github.com/YiguoHe/SAR-TEXT.
中文: 本文提出了SAR-TEXT,一个包含超过13万对SAR图像-文本的大规模高质量数据集,通过SAR-Narrator框架生成,显著提升了遥感任务中的检索、描述和视觉问答性能。
English: This paper introduces SAR-TEXT, a large-scale dataset of over 130,000 SAR image-text pairs generated via the SAR-Narrator framework, which significantly enhances performance in remote sensing tasks like retrieval, captioning, and VQA.
Authors:VÃctor Gallego
Abstract:
Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user's true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\% of cases, the SSC process reduces this vulnerability by over 90\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .
中文: 语言模型会利用有缺陷的规范来获取高分却未满足用户真实意图,而提出的规范自校正框架能让模型在推理时识别并修正这些缺陷,无需调整权重即可将漏洞减少90%以上。
English: Language models can exploit flawed specifications to achieve high scores without meeting user intent, but the proposed Specification Self-Correction framework enables them to identify and correct these flaws at inference time, reducing vulnerability by over 90% without weight modifications.
Authors:Jake McNaughton, Mohamed Hibat-Allah
Abstract:
Neural-network quantum states (NQS) are powerful neural-network ansätzes that have emerged as promising tools for studying quantum many-body physics through the lens of the variational principle. These architectures are known to be systematically improvable by increasing the number of parameters. Here we demonstrate an Adaptive scheme to optimize NQSs, through the example of recurrent neural networks (RNN), using a fraction of the computation cost while reducing training fluctuations and improving the quality of variational calculations targeting ground states of prototypical models in one- and two-spatial dimensions. This Adaptive technique reduces the computational cost through training small RNNs and reusing them to initialize larger RNNs. This work opens up the possibility for optimizing graphical processing unit (GPU) resources deployed in large-scale NQS simulations.
中文: 自适应方案通过训练小型循环神经网络并复用其参数来初始化更大的网络,从而降低计算成本并提升量子多体模型基态变分计算的质量。
English: The Adaptive scheme optimizes neural-network quantum states by training small recurrent neural networks and reusing them to initialize larger ones, reducing computational costs and improving variational ground state calculations in quantum many-body models.
Authors:Siyu Mu, Wei Xuan Chan, Choon Hwai Yap
Abstract:
The unloaded cardiac geometry (i.e., the state of the heart devoid of luminal pressure) serves as a valuable zero-stress and zero-strain reference and is critical for personalized biomechanical modeling of cardiac function, to understand both healthy and diseased physiology and to predict the effects of cardiac interventions. However, estimating the unloaded geometry from clinical images remains a challenging task. Traditional approaches rely on inverse finite element (FE) solvers that require iterative optimization and are computationally expensive. In this work, we introduce HeartUnloadNet, a deep learning framework that predicts the unloaded left ventricular (LV) shape directly from the end diastolic (ED) mesh while explicitly incorporating biophysical priors. The network accepts a mesh of arbitrary size along with physiological parameters such as ED pressure, myocardial stiffness scale, and fiber helix orientation, and outputs the corresponding unloaded mesh. It adopts a graph attention architecture and employs a cycle-consistency strategy to enable bidirectional (loading and unloading) prediction, allowing for partial self-supervision that improves accuracy and reduces the need for large training datasets. Trained and tested on 20,700 FE simulations across diverse LV geometries and physiological conditions, HeartUnloadNet achieves sub-millimeter accuracy, with an average DSC of 0.986 and HD of 0.083 cm, while reducing inference time to just 0.02 seconds per case, over 10^5 times faster and significantly more accurate than traditional inverse FE solvers. Ablation studies confirm the effectiveness of the architecture. Notably, the cycle-consistent design enables the model to maintain a DSC of 97% even with as few as 200 training samples. This work thus presents a scalable and accurate surrogate for inverse FE solvers, supporting real-time clinical applications in the future.
中文: HeartUnloadNet是一种深度学习框架,通过结合生物物理先验和循环一致性策略,能直接从临床图像中精确快速地预测左心室无负载几何形状,其精度达亚毫米级,计算效率比传统方法提高超过10万倍。
English: HeartUnloadNet is a deep learning framework that accurately and rapidly predicts the unloaded left ventricular geometry from clinical images using biophysical priors and cycle-consistency, achieving sub-millimeter precision and reducing computation time by over 100,000 times compared to traditional methods.
Authors:James Dickens, Kamyar Hamad
Abstract:
Recent advances in point cloud deep learning have led to models that achieve high per-part labeling accuracy on large-scale point clouds, using only the raw geometry of unordered point sets. In parallel, the field of human parsing focuses on predicting body part and clothing/accessory labels from images. This work aims to bridge these two domains by enabling per-vertex semantic segmentation of large-scale human meshes. To achieve this, a pseudo-ground truth labeling pipeline is developed for the Thuman2.1 dataset: meshes are first aligned to a canonical pose, segmented from multiple viewpoints, and the resulting point-level labels are then backprojected onto the original mesh to produce per-point pseudo ground truth annotations. Subsequently, a novel, memory-efficient sampling strategy is introduced, a windowed iterative farthest point sampling (FPS) with space-filling curve-based serialization to effectively downsample the point clouds. This is followed by a purely geometric segmentation using PointTransformer, enabling semantic parsing of human meshes without relying on texture information. Experimental results confirm the effectiveness and accuracy of the proposed approach. Project code and pre-processed data is available at https://github.com/JamesMcCullochDickens/Human3DParsing/tree/master.
中文摘要:本研究通过开发伪标注生成流程和结合内存高效采样策略与PointTransformer的方法,实现了仅利用几何数据对人体网格进行语义分割,从而连接了点云深度学习与人体解析两个领域。
English Summary: This study bridges point cloud deep learning and human parsing by enabling semantic segmentation of human meshes using only geometric data, achieved through a pseudo-ground truth pipeline and a memory-efficient sampling strategy combined with PointTransformer.
Authors:Hao Li, Lijun Li, Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, Lei Sha
Abstract:
With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions.
In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a Layer-Aware Representation Filtering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features.
Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated. Please see our code at https://github.com/LLLeoLi/LARF.
中文: 本文提出LARF方法,通过识别并移除微调数据中损害安全性的样本,有效缓解大语言模型在适配应用时的安全对齐退化问题。
English: The paper introduces LARF, a layer-aware filtering method that identifies and removes safety-degrading data in fine-tuning datasets to preserve LLM safety alignment during adaptation.
Authors:Xiaopeng Ke, Hexuan Deng, Xuebo Liu, Jun Rao, Zhenxi Song, Jun Yu, Min Zhang
Abstract:
Despite the impressive performance of large language models (LLMs) in general domains, they often underperform in specialized domains. Existing approaches typically rely on data synthesis methods and yield promising results by using unlabeled data to capture domain-specific features. However, these methods either incur high computational costs or suffer from performance limitations, while also demonstrating insufficient generalization across different tasks. To address these challenges, we propose AQuilt, a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data, including Answer, Question, Unlabeled data, Inspection, Logic, and Task type. By incorporating logic and inspection, we encourage reasoning processes and self-inspection to enhance model performance. Moreover, customizable task instructions enable high-quality data generation for any task. As a result, we construct a dataset of 703k examples to train a powerful data synthesis model. Experiments show that AQuilt is comparable to DeepSeek-V3 while utilizing just 17% of the production cost. Further analysis demonstrates that our generated data exhibits higher relevance to downstream tasks. Source code, models, and scripts are available at https://github.com/Krueske/AQuilt.
Chinese Summary: AQuilt框架通过从未标注数据中构建指令调优数据,以仅17%的成本实现了与DeepSeek-V3相当的性能,同时在下游任务中展现出更高的数据相关性。
English Summary: The AQuilt framework efficiently generates high-quality instruction-tuning data from unlabeled domain-specific data, achieving performance comparable to DeepSeek-V3 at just 17% of the cost while demonstrating superior task relevance.
Authors:Jiahao Wang, Ramen Liu, Longhui Zhang, Jing Li
Abstract:
This paper presents our system for CCL25-Eval Task 10, addressing Fine-Grained Chinese Hate Speech Recognition (FGCHSR). We propose a novel SRAG-MAV framework that synergistically integrates task reformulation(TR), Self-Retrieval-Augmented Generation (SRAG), and Multi-Round Accumulative Voting (MAV). Our method reformulates the quadruplet extraction task into triplet extraction, uses dynamic retrieval from the training set to create contextual prompts, and applies multi-round inference with voting to improve output stability and performance. Our system, based on the Qwen2.5-7B model, achieves a Hard Score of 26.66, a Soft Score of 48.35, and an Average Score of 37.505 on the STATE ToxiCN dataset, significantly outperforming baselines such as GPT-4o (Average Score 15.63) and fine-tuned Qwen2.5-7B (Average Score 35.365). The code is available at https://github.com/king-wang123/CCL25-SRAG-MAV.
中文:本文提出SRAG-MAV框架用于细粒度中文仇恨言论识别,通过任务重构、动态检索增强生成与多轮累积投票机制,基于Qwen2.5-7B模型在ToxiCN数据集上取得了超越基线模型的优异性能。
English: This paper introduces the SRAG-MAV framework for Fine-Grained Chinese Hate Speech Recognition, which reformulates quadruplet extraction into triplet tasks, employs dynamic retrieval and multi-round voting to enhance stability, and achieves state-of-the-art performance on the ToxiCN dataset using the Qwen2.5-7B model.
Authors:Liyuan Chen, Shuoling Liu, Jiangpeng Yan, Xiaoyu Wang, Henglin Liu, Chuang Li, Kecheng Jiao, Jixuan Ying, Yang Veronica Liu, Qiang Yang, Xiu Li
Abstract:
The advent of foundation models (FMs) - large-scale pre-trained models with strong generalization capabilities - has opened new frontiers for financial engineering. While general-purpose FMs such as GPT-4 and Gemini have demonstrated promising performance in tasks ranging from financial report summarization to sentiment-aware forecasting, many financial applications remain constrained by unique domain requirements such as multimodal reasoning, regulatory compliance, and data privacy. These challenges have spurred the emergence of Financial Foundation Models (FFMs) - a new class of models explicitly designed for finance. This survey presents a comprehensive overview of FFMs, with a taxonomy spanning three key modalities: Financial Language Foundation Models (FinLFMs), Financial Time-Series Foundation Models (FinTSFMs), and Financial Visual-Language Foundation Models (FinVLFMs). We review their architectures, training methodologies, datasets, and real-world applications. Furthermore, we identify critical challenges in data availability, algorithmic scalability, and infrastructure constraints, and offer insights into future research opportunities. We hope this survey serves as both a comprehensive reference for understanding FFMs and a practical roadmap for future innovation. An updated collection of FFM-related publications and resources will be maintained on our website https://github.com/FinFM/Awesome-FinFMs.
Chinese: 基础模型正在革新金融工程,催生了专门应对多模态推理和监管合规等金融领域挑战的金融基础模型,本综述系统梳理了其架构分类、应用场景及未来研究方向,为领域发展提供路线图。
English: Foundation models are revolutionizing financial engineering by enabling specialized Financial Foundation Models that address domain-specific challenges like multimodal reasoning and regulatory compliance, with this survey providing a comprehensive taxonomy and analysis of their architectures, applications, and future research directions.
Authors:Xinyu Wang, Jinghua Hou, Zhe Liu, Yingying Zhu
Abstract:
Transformer-based methods have demonstrated remarkable capabilities in 3D semantic segmentation through their powerful attention mechanisms, but the quadratic complexity limits their modeling of long-range dependencies in large-scale point clouds. While recent Mamba-based approaches offer efficient processing with linear complexity, they struggle with feature representation when extracting 3D features. However, effectively combining these complementary strengths remains an open challenge in this field. In this paper, we propose HybridTM, the first hybrid architecture that integrates Transformer and Mamba for 3D semantic segmentation. In addition, we propose the Inner Layer Hybrid Strategy, which combines attention and Mamba at a finer granularity, enabling simultaneous capture of long-range dependencies and fine-grained local features. Extensive experiments demonstrate the effectiveness and generalization of our HybridTM on diverse indoor and outdoor datasets. Furthermore, our HybridTM achieves state-of-the-art performance on ScanNet, ScanNet200, and nuScenes benchmarks. The code will be made available at https://github.com/deepinact/HybridTM.
中文: HybridTM是一种创新的混合架构,将Transformer和Mamba相结合用于3D语义分割,有效整合两者的互补优势以捕捉长程依赖关系和细粒度局部特征,在多个基准测试中实现了最先进的性能。
English: HybridTM is a novel hybrid architecture that integrates Transformer and Mamba for 3D semantic segmentation, effectively combining their complementary strengths to capture long-range dependencies and fine-grained local features, achieving state-of-the-art performance on multiple benchmarks.
Authors:Baoyao Yang, Wanyun Li, Dixin Chen, Junxiang Chen, Wenbin Yao, Haifeng Lin
Abstract:
This paper introduces VideoMind, a video-centric omni-modal dataset designed for deep video content cognition and enhanced multi-modal feature representation. The dataset comprises 103K video samples (3K reserved for testing), each paired with audio and systematically detailed textual descriptions. Specifically, every video and its audio is described across three hierarchical layers (factual, abstract, and intent), progressing from surface to depth. It contains over 22 million words, averaging ~225 words per sample. VideoMind's key distinction from existing datasets is its provision of intent expressions, which require contextual integration across the entire video and are not directly observable. These deep-cognitive expressions are generated using a Chain-of-Thought (COT) approach, prompting the mLLM through step-by-step reasoning. Each description includes annotations for subject, place, time, event, action, and intent, supporting downstream recognition tasks. Crucially, we establish a gold-standard benchmark with 3,000 manually validated samples for evaluating deep-cognitive video understanding. We design hybrid-cognitive retrieval experiments, scored by multi-level retrieval metrics, to appropriately assess deep video comprehension. Evaluation results for models (e.g., InternVideo, VAST, UMT-L) are released. VideoMind serves as a powerful benchmark for fine-grained cross-modal alignment and advances fields requiring in-depth video understanding, such as emotion and intent recognition. The data is publicly available on GitHub, HuggingFace, and OpenDataLab, https://github.com/cdx-cindy/VideoMind.
中文: 本文介绍了VideoMind视频数据集,该数据集通过链式思维方法生成包含意图表达的多层次文本描述,为深度视频理解和跨模态对齐任务提供了重要基准。
English: This paper presents VideoMind, a comprehensive video dataset with audio and multi-layered textual descriptions that uniquely includes intent expressions generated via Chain-of-Thought reasoning, serving as a benchmark for deep video understanding and cross-modal alignment tasks.
Authors:Daniil Morozov, Reuben Dorent, Nazim Haouchine
Abstract:
Intraoperative registration of real-time ultrasound (iUS) to preoperative Magnetic Resonance Imaging (MRI) remains an unsolved problem due to severe modality-specific differences in appearance, resolution, and field-of-view. To address this, we propose a novel 3D cross-modal keypoint descriptor for MRI-iUS matching and registration. Our approach employs a patient-specific matching-by-synthesis approach, generating synthetic iUS volumes from preoperative MRI. This enables supervised contrastive training to learn a shared descriptor space.
A probabilistic keypoint detection strategy is then employed to identify anatomically salient and modality-consistent locations. During training, a curriculum-based triplet loss with dynamic hard negative mining is used to learn descriptors that are i) robust to iUS artifacts such as speckle noise and limited coverage, and ii) rotation-invariant . At inference, the method detects keypoints in MR and real iUS images and identifies sparse matches, which are then used to perform rigid registration. Our approach is evaluated using 3D MRI-iUS pairs from the ReMIND dataset. Experiments show that our approach outperforms state-of-the-art keypoint matching methods across 11 patients, with an average precision of $69.8\%$. For image registration, our method achieves a competitive mean Target Registration Error of 2.39 mm on the ReMIND2Reg benchmark.
Compared to existing iUS-MR registration approach, our framework is interpretable, requires no manual initialization, and shows robustness to iUS field-of-view variation. Code is available at https://github.com/morozovdd/CrossKEY.
中文: 本研究提出一种新型三维跨模态关键点描述符用于MRI-超声配准,通过合成超声生成和对比学习实现无需人工初始化的鲁棒配准,在ReMIND数据集上以69.8%的匹配精度和2.39毫米配准误差超越现有方法。
English: This study introduces a novel 3D cross-modal keypoint descriptor for MRI-ultrasound registration, using synthetic ultrasound generation and contrastive learning to achieve robust, interpretable alignment without manual initialization, outperforming existing methods with 69.8% precision and 2.39 mm registration error.
Authors:Urchade Zaratiana, Gil Pasternak, Oliver Boyd, George Hurn-Maloney, Ash Lewis
Abstract:
Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built pretrained transformer encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across extraction and classification tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source pip-installable library with pre-trained models and documentation at https://github.com/fastino-ai/GLiNER2.
Chinese: GLiNER2 是一个统一且高效的框架,能在单一模型中支持命名实体识别、文本分类等多种自然语言处理任务,相比专用模型或基于大语言模型的方案,它性能优异且部署便捷。
English: GLiNER2 is a unified and efficient framework that supports multiple NLP tasks like named entity recognition and text classification within a single model, offering competitive performance and easy deployment compared to specialized or LLM-based solutions.
Authors:Zihang Li, Hao Xie, Xinyang Dong, Lei Wang
Abstract:
We develop a deep variational free energy framework to compute the equation of state of hydrogen in the warm dense matter region. This method parameterizes the variational density matrix of hydrogen nuclei and electrons at finite temperature using three deep generative models: a normalizing flow model that represents the Boltzmann distribution of the classical nuclei, an autoregressive transformer that models the distribution of electrons in excited states, and a permutational equivariant flow model that constructs backflow coordinates for electrons in Hartree-Fock orbitals. By jointly optimizing the three neural networks to minimize the variational free energy, we obtain the equation of state and related thermodynamic properties of dense hydrogen. We compare our results with other theoretical and experimental results on the deuterium Hugoniot curve, aiming to resolve existing discrepancies. The calculated results provide a valuable benchmark for deuterium in the warm dense matter region.
中文: 本研究开发了一种深度变分自由能方法,通过三种神经网络模型计算氢在温稠密物质区的状态方程,为氘提供了基准数据,旨在解决现有研究中的差异。
English: This study introduces a deep variational free energy approach using three neural networks to compute hydrogen's equation of state in warm dense matter, offering a benchmark for deuterium by resolving discrepancies in existing data.
Authors:Zhekai Chen, Ruihang Chu, Yukang Chen, Shiwei Zhang, Yujie Wei, Yingya Zhang, Xihui Liu
Abstract:
Scaling visual generation models is essential for real-world content creation, yet requires substantial training and computational expenses. Alternatively, test-time scaling has garnered growing attention due to resource efficiency and promising performance. In this work, we present TTS-VAR, the first general test-time scaling framework for visual auto-regressive (VAR) models, modeling the generation process as a path searching problem. To dynamically balance computational efficiency with exploration capacity, we first introduce an adaptive descending batch size schedule throughout the causal generation process. Besides, inspired by VAR's hierarchical coarse-to-fine multi-scale generation, our framework integrates two key components: (i) At coarse scales, we observe that generated tokens are hard for evaluation, possibly leading to erroneous acceptance of inferior samples or rejection of superior samples. Noticing that the coarse scales contain sufficient structural information, we propose clustering-based diversity search. It preserves structural variety through semantic feature clustering, enabling later selection on samples with higher potential. (ii) In fine scales, resampling-based potential selection prioritizes promising candidates using potential scores, which are defined as reward functions incorporating multi-scale generation history. Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement (from 0.69 to 0.75). Key insights reveal that early-stage structural features effectively influence final quality, and resampling efficacy varies across generation scales. Code is available at https://github.com/ali-vilab/TTS-VAR.
Chinese: TTS-VAR提出了一种视觉自回归模型的测试时扩展框架,通过自适应批次调度、基于聚类的多样性搜索保留结构信息,以及基于重采样的潜力选择来提升生成质量,使GenEval得分显著提高了8.7%。
English: TTS-VAR introduces a test-time scaling framework for visual auto-regressive models that employs adaptive batch scheduling, clustering-based diversity search for structural preservation, and resampling-based potential selection to enhance generation quality, achieving an 8.7% improvement in GenEval scores.
Authors:Tianheng Qiu, Jingchun Gao, Jingyu Li, Huiyi Leong, Xuan Huang, Xi Wang, Xiaocheng Zhang, Kele Xu, Lan Zhang
Abstract:
Intent-oriented controlled video captioning aims to generate targeted descriptions for specific targets in a video based on customized user intent. Current Large Visual Language Models (LVLMs) have gained strong instruction following and visual comprehension capabilities. Although the LVLMs demonstrated proficiency in spatial and temporal understanding respectively, it was not able to perform fine-grained spatial control in time sequences in direct response to instructions. This substantial spatio-temporal gap complicates efforts to achieve fine-grained intention-oriented control in video. Towards this end, we propose a novel IntentVCNet that unifies the temporal and spatial understanding knowledge inherent in LVLMs to bridge the spatio-temporal gap from both prompting and model perspectives. Specifically, we first propose a prompt combination strategy designed to enable LLM to model the implicit relationship between prompts that characterize user intent and video sequences. We then propose a parameter efficient box adapter that augments the object semantic information in the global visual context so that the visual token has a priori information about the user intent. The final experiment proves that the combination of the two strategies can further enhance the LVLM's ability to model spatial details in video sequences, and facilitate the LVLMs to accurately generate controlled intent-oriented captions. Our proposed method achieved state-of-the-art results in several open source LVLMs and was the runner-up in the IntentVC challenge. Our code is available on https://github.com/thqiu0419/IntentVCNet.
中文: 提出的IntentVCNet通过结合提示策略和框适配器,弥合了大型视觉语言模型中的时空差距,实现了细粒度的意图导向视频描述,并取得了最先进的成果。
English: The proposed IntentVCNet bridges the spatio-temporal gap in large visual language models by combining prompt strategies and a box adapter, enabling fine-grained intent-oriented video captioning and achieving state-of-the-art results.
Authors:Clément Cornet, Romaric Besançon, Hervé Le Borgne
Abstract:
Sparse autoencoders (SAEs) have emerged as a powerful technique for extracting human-interpretable features from neural networks activations. Previous works compared different models based on SAE-derived features but those comparisons have been restricted to models within the same modality. We propose a novel indicator allowing quantitative comparison of models across SAE features, and use it to conduct a comparative study of visual, textual and multimodal encoders. We also propose to quantify the Comparative Sharedness of individual features between different classes of models. With these two new tools, we conduct several studies on 21 encoders of the three types, with two significantly different sizes, and considering generalist and domain specific datasets. The results allow to revisit previous studies at the light of encoders trained in a multimodal context and to quantify to which extent all these models share some representations or features. They also suggest that visual features that are specific to VLMs among vision encoders are shared with text encoders, highlighting the impact of text pretraining. The code is available at https://github.com/CEA-LIST/SAEshareConcepts
中文摘要:稀疏自编码器通过新型量化指标和特征共享度评估,实现了跨模态模型的比较研究,发现多模态视觉模型中的特定视觉特征与文本编码器共享,体现了文本预训练的影响。
English Summary: Sparse autoencoders enable cross-modal model comparisons through a novel quantitative indicator and comparative sharedness metric, revealing that visual features in multimodal models overlap with text encoders due to pretraining influences.
Authors:Zongzheng Zhang, Xuchong Qiu, Boran Zhang, Guantian Zheng, Xunjiang Gu, Guoxuan Chi, Huan-ang Gao, Leichen Wang, Ziming Liu, Xinrun Li, Igor Gilitschenski, Hongyang Li, Hang Zhao, Hao Zhao
Abstract:
Recent advances in autonomous driving are moving towards mapless approaches, where High-Definition (HD) maps are generated online directly from sensor data, reducing the need for expensive labeling and maintenance. However, the reliability of these online-generated maps remains uncertain. While incorporating map uncertainty into downstream trajectory prediction tasks has shown potential for performance improvements, current strategies provide limited insights into the specific scenarios where this uncertainty is beneficial. In this work, we first analyze the driving scenarios in which mapping uncertainty has the greatest positive impact on trajectory prediction and identify a critical, previously overlooked factor: the agent's kinematic state. Building on these insights, we propose a novel Proprioceptive Scenario Gating that adaptively integrates map uncertainty into trajectory prediction based on forecasts of the ego vehicle's future kinematics. This lightweight, self-supervised approach enhances the synergy between online mapping and trajectory prediction, providing interpretability around where uncertainty is advantageous and outperforming previous integration methods. Additionally, we introduce a Covariance-based Map Uncertainty approach that better aligns with map geometry, further improving trajectory prediction. Extensive ablation studies confirm the effectiveness of our approach, achieving up to 23.6% improvement in mapless trajectory prediction performance over the state-of-the-art method using the real-world nuScenes driving dataset. Our code, data, and models are publicly available at https://github.com/Ethan-Zheng136/Map-Uncertainty-for-Trajectory-Prediction.
Chinese: 本研究提出了一种本体感知场景门控机制,通过基于自车运动状态预测自适应融合地图不确定性到轨迹预测中,在无地图自动驾驶系统中实现了高达23.6%的性能提升。
English: This study introduces a proprioceptive scenario gating mechanism that adaptively integrates map uncertainty into trajectory prediction based on ego-vehicle kinematic forecasts, achieving up to 23.6% performance improvement in mapless autonomous driving systems.
Authors:Frauke Wilm, Luis Carlos Rivera Monroy, Mathias Ãttl, Lukas Mürdter, Leonid Mill, Andreas Maier
Abstract:
Accurate detection of Plasmodium falciparum in Giemsa-stained blood smears is an essential component of reliable malaria diagnosis, especially in developing countries. Deep learning-based object detection methods have demonstrated strong potential for automated Malaria diagnosis, but their adoption is limited by the scarcity of datasets with detailed instance-level annotations. In this work, we present an enhanced version of the publicly available NIH malaria dataset, with detailed bounding box annotations in COCO format to support object detection training. We validated the revised annotations by training a Faster R-CNN model to detect infected and non-infected red blood cells, as well as white blood cells. Cross-validation on the original dataset yielded F1 scores of up to 0.88 for infected cell detection. These results underscore the importance of annotation volume and consistency, and demonstrate that automated annotation refinement combined with targeted manual correction can produce training data of sufficient quality for robust detection performance. The updated annotations set is publicly available via GitHub: https://github.com/MIRA-Vision-Microscopy/malaria-thin-smear-coco.
中文: 本研究通过完善疟疾数据集的精细化标注,并采用Faster R-CNN模型验证,证明提升标注质量可实现可靠的感染细胞自动检测,最高F1分数达0.88。
English: This study enhances a malaria dataset with detailed object detection annotations and demonstrates through Faster R-CNN validation that improved annotation quality enables robust automated detection of infected cells, achieving an F1 score up to 0.88.
Authors:Francesco Dalmonte, Emirhan Bayar, Emre Akbas, Mariana-Iuliana Georgescu
Abstract:
Anomaly detection in medical images is an important yet challenging task due to the diversity of possible anomalies and the practical impossibility of collecting comprehensively annotated data sets. In this work, we tackle unsupervised medical anomaly detection proposing a modernized autoencoder-based framework, the Q-Former Autoencoder, that leverages state-of-the-art pretrained vision foundation models, such as DINO, DINOv2 and Masked Autoencoder. Instead of training encoders from scratch, we directly utilize frozen vision foundation models as feature extractors, enabling rich, multi-stage, high-level representations without domain-specific fine-tuning. We propose the usage of the Q-Former architecture as the bottleneck, which enables the control of the length of the reconstruction sequence, while efficiently aggregating multiscale features. Additionally, we incorporate a perceptual loss computed using features from a pretrained Masked Autoencoder, guiding the reconstruction towards semantically meaningful structures. Our framework is evaluated on four diverse medical anomaly detection benchmarks, achieving state-of-the-art results on BraTS2021, RESC, and RSNA. Our results highlight the potential of vision foundation model encoders, pretrained on natural images, to generalize effectively to medical image analysis tasks without further fine-tuning. We release the code and models at https://github.com/emirhanbayar/QFAE.
中文: 本研究提出Q-Former自编码器,这是一种无监督医学异常检测框架,利用预训练的冻结视觉模型提取丰富特征,无需领域特定微调即在多个基准测试中取得最优性能。
English: This study introduces the Q-Former Autoencoder, an unsupervised medical anomaly detection framework that leverages frozen pretrained vision models to extract rich features and achieves state-of-the-art results on multiple benchmarks without domain-specific fine-tuning.
Authors:Haoran Xu, Saining Zhang, Peishuo Li, Baijun Ye, Xiaoxue Chen, Huan-ang Gao, Jv Zheng, Xiaowei Song, Ziqiao Peng, Run Miao, Jinrang Jia, Yifeng Shi, Guangqi Yi, Hang Zhao, Hao Tang, Hongyang Li, Kaicheng Yu, Hao Zhao
Abstract:
Vehicle-to-everything (V2X) communication plays a crucial role in autonomous driving, enabling cooperation between vehicles and infrastructure. While simulation has significantly contributed to various autonomous driving tasks, its potential for data generation and augmentation in V2X scenarios remains underexplored. In this paper, we introduce CRUISE, a comprehensive reconstruction-and-synthesis framework designed for V2X driving environments. CRUISE employs decomposed Gaussian Splatting to accurately reconstruct real-world scenes while supporting flexible editing. By decomposing dynamic traffic participants into editable Gaussian representations, CRUISE allows for seamless modification and augmentation of driving scenes. Furthermore, the framework renders images from both ego-vehicle and infrastructure views, enabling large-scale V2X dataset augmentation for training and evaluation. Our experimental results demonstrate that: 1) CRUISE reconstructs real-world V2X driving scenes with high fidelity; 2) using CRUISE improves 3D detection across ego-vehicle, infrastructure, and cooperative views, as well as cooperative 3D tracking on the V2X-Seq benchmark; and 3) CRUISE effectively generates challenging corner cases.
中文: 本文提出CRUISE框架,通过分解式高斯溅射重建和编辑车联网驾驶场景,实现高保真数据增强,有效提升自动驾驶在检测与跟踪任务中的性能表现。
English: This paper introduces CRUISE, a reconstruction-and-synthesis framework that uses decomposed Gaussian Splatting to reconstruct and edit V2X driving scenes, enabling high-fidelity data augmentation and improving autonomous driving performance in detection and tracking tasks.
Authors:Miguel Aspis, Sebastián A. Cajas Ordónez, Andrés L. Suárez-Cetrulo, Ricardo Simón Carbajo
Abstract:
Learning from non-stationary data streams subject to concept drift requires models that can adapt on-the-fly while remaining resource-efficient. Existing adaptive ensemble methods often rely on coarse-grained adaptation mechanisms or simple voting schemes that fail to optimally leverage specialized knowledge. This paper introduces DriftMoE, an online Mixture-of-Experts (MoE) architecture that addresses these limitations through a novel co-training framework. DriftMoE features a compact neural router that is co-trained alongside a pool of incremental Hoeffding tree experts. The key innovation lies in a symbiotic learning loop that enables expert specialization: the router selects the most suitable expert for prediction, the relevant experts update incrementally with the true label, and the router refines its parameters using a multi-hot correctness mask that reinforces every accurate expert. This feedback loop provides the router with a clear training signal while accelerating expert specialization. We evaluate DriftMoE's performance across nine state-of-the-art data stream learning benchmarks spanning abrupt, gradual, and real-world drifts testing two distinct configurations: one where experts specialize on data regimes (multi-class variant), and another where they focus on single-class specialization (task-based variant). Our results demonstrate that DriftMoE achieves competitive results with state-of-the-art stream learning adaptive ensembles, offering a principled and efficient approach to concept drift adaptation. All code, data pipelines, and reproducibility scripts are available in our public GitHub repository: https://github.com/miguel-ceadar/drift-moe.
中文: DriftMoE提出了一种在线混合专家架构,通过协同训练框架和共生学习循环实现专家专业化,在多种数据流基准测试中实现了具有竞争力的概念漂移适应性能。
English: DriftMoE introduces an online Mixture-of-Experts architecture with a co-training framework that enables expert specialization through a symbiotic learning loop, achieving competitive performance in concept drift adaptation across multiple data stream benchmarks.
Authors:Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li
Abstract:
Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based large audio-language model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at https://github.com/NKU-HLT/DIFFA.git.
中文:DIFFA是首个基于扩散的大型音频语言模型,通过双适配器架构和两阶段训练,在音频理解基准测试中展现出竞争力,为语音驱动AI开辟了新方向。
English: DIFFA is the first diffusion-based large audio-language model that integrates a frozen diffusion model with a dual-adapter architecture, achieving competitive performance on audio understanding benchmarks through a two-stage training process.
Authors:Simin Huo, Ning Li
Abstract:
We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer's limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at https://github.com/cominder/Iwin-Transformer.
中文摘要:Iwin Transformer 是一种无需位置嵌入的分层视觉模型,通过交错窗口注意力与深度卷积的协同设计,在单一模块中实现全局信息交互,在图像分类、语义分割和视频识别等任务中展现出卓越性能。
English Summary: The Iwin Transformer is a hierarchical vision model that integrates interleaved window attention and depthwise convolution to achieve global information exchange in a single module, demonstrating strong performance across image classification, segmentation, and video tasks.
Authors:Yonghao Fu, Cheng Hu, Haokun Xiong, Zhanpeng Bao, Wenyuan Du, Edoardo Ghignone, Michele Magno, Lei Xie, Hongye Su
Abstract:
In vehicle trajectory tracking tasks, the simplest approach is the Pure Pursuit (PP) Control. However, this single-point preview tracking strategy fails to consider vehicle model constraints, compromising driving safety. Model Predictive Control (MPC) as a widely adopted control method, optimizes control actions by incorporating mechanistic models and physical constraints. While its control performance critically depends on the accuracy of vehicle modeling. Traditional vehicle modeling approaches face inherent trade-offs between capturing nonlinear dynamics and maintaining computational efficiency, often resulting in reduced control performance. To address these challenges, this paper proposes Residual Koopman Model Predictive Control (RKMPC) framework. This method uses two linear MPC architecture to calculate control inputs: a Linear Model Predictive Control (LMPC) computes the baseline control input based on the vehicle kinematic model, and a neural network-based RKMPC calculates the compensation input. The final control command is obtained by adding these two components. This design preserves the reliability and interpretability of traditional mechanistic model while achieving performance optimization through residual modeling. This method has been validated on the Carsim-Matlab joint simulation platform and a physical 1:10 scale F1TENTH racing car. Experimental results show that RKMPC requires only 20% of the training data needed by traditional Koopman Model Predictive Control (KMPC) while delivering superior tracking performance. Compared to traditional LMPC, RKMPC reduces lateral error by 11.7%-22.1%, decreases heading error by 8.9%-15.8%, and improves front-wheel steering stability by up to 27.6%. The implementation code is available at: https://github.com/ZJU-DDRX/Residual Koopman.
中文: 本文提出残差库普曼模型预测控制(RKMPC)框架,通过结合线性MPC和基于神经网络的补偿器,有效提升了车辆轨迹跟踪的精度和稳定性,并在仿真与实车测试中验证了其优越性能。
English: This paper introduces the Residual Koopman Model Predictive Control (RKMPC) framework, which combines a linear MPC with a neural network-based compensator to enhance vehicle trajectory tracking by reducing errors and improving stability, validated through simulations and physical tests.
Authors:Yilong Hu, Shijie Chang, Lihe Zhang, Feng Tian, Weibing Sun, Huchuan Lu
Abstract:
The Diffusion Probabilistic Model (DPM) has demonstrated remarkable performance across a variety of generative tasks. The inherent randomness in diffusion models helps address issues such as blurring at the edges of medical images and labels, positioning Diffusion Probabilistic Models (DPMs) as a promising approach for lesion segmentation. However, we find that the current training and inference strategies of diffusion models result in an uneven distribution of attention across different timesteps, leading to longer training times and suboptimal solutions. To this end, we propose UniSegDiff, a novel diffusion model framework designed to address lesion segmentation in a unified manner across multiple modalities and organs. This framework introduces a staged training and inference approach, dynamically adjusting the prediction targets at different stages, forcing the model to maintain high attention across all timesteps, and achieves unified lesion segmentation through pre-training the feature extraction network for segmentation. We evaluate performance on six different organs across various imaging modalities. Comprehensive experimental results demonstrate that UniSegDiff significantly outperforms previous state-of-the-art (SOTA) approaches. The code is available at https://github.com/HUYILONG-Z/UniSegDiff.
Chinese: 提出的UniSegDiff框架通过引入分阶段训练和推理方法,在所有时间步保持一致的注意力,显著提升了多模态和多器官的病灶分割性能,明显优于以往的最先进方法。
English: The proposed UniSegDiff framework enhances lesion segmentation across multiple modalities and organs by introducing a staged training and inference approach that maintains consistent attention across all timesteps, significantly outperforming previous state-of-the-art methods.
Authors:Yifu Chen, Bingchen Huang, Zhiling Wang, Yuanchao Du, Junfeng Luo, Lei Shen, Zhineng chen
Abstract:
In-context learning (ICL) has become a classic approach for enabling LLMs to handle various tasks based on a few input-output examples. The effectiveness of ICL heavily relies on the quality of these examples, and previous works which focused on enhancing example retrieval capabilities have achieved impressive performances. However, two challenges remain in retrieving high-quality examples: (1) Difficulty in distinguishing cross-task data distributions, (2) Difficulty in making the fine-grained connection between retriever output and feedback from LLMs. In this paper, we propose a novel framework called TDR. TDR decouples the ICL examples from different tasks, which enables the retrieval module to retrieve examples specific to the target task within a multi-task dataset. Furthermore, TDR models fine-grained feedback from LLMs to supervise and guide the training of the retrieval module, which helps to retrieve high-quality examples. We conducted extensive experiments on a suite of 30 NLP tasks, the results demonstrate that TDR consistently improved results across all datasets and achieves state-of-the-art performance. Meanwhile, our approach is a plug-and-play method, which can be easily combined with various LLMs to improve example retrieval abilities for ICL. The code is available at https://github.com/Nnn-s/TDR.
中文:TDR框架通过解耦跨任务示例并利用大语言模型的细粒度反馈来提升高质量示例的检索能力,在30项自然语言处理任务中实现了最优性能。
English: The TDR framework enhances in-context learning by decoupling examples across tasks and using fine-grained LLM feedback to improve retrieval of high-quality examples, achieving state-of-the-art performance on 30 NLP tasks.
Authors:Runmin Zhang, Zhu Yu, Si-Yuan Cao, Lingyu Zhu, Guangyi Zhang, Xiaokai Bai, Hui-Liang Shen
Abstract:
This work presents SGCDet, a novel multi-view indoor 3D object detection framework based on adaptive 3D volume construction. Unlike previous approaches that restrict the receptive field of voxels to fixed locations on images, we introduce a geometry and context aware aggregation module to integrate geometric and contextual information within adaptive regions in each image and dynamically adjust the contributions from different views, enhancing the representation capability of voxel features. Furthermore, we propose a sparse volume construction strategy that adaptively identifies and selects voxels with high occupancy probabilities for feature refinement, minimizing redundant computation in free space. Benefiting from the above designs, our framework achieves effective and efficient volume construction in an adaptive way. Better still, our network can be supervised using only 3D bounding boxes, eliminating the dependence on ground-truth scene geometry. Experimental results demonstrate that SGCDet achieves state-of-the-art performance on the ScanNet, ScanNet200 and ARKitScenes datasets. The source code is available at https://github.com/RM-Zhang/SGCDet.
中文: SGCDet提出了一种自适应三维体积构建框架,通过几何感知聚合和稀疏体素选择,仅需三维边界框监督即可实现最先进的室内三维物体检测。
English: SGCDet introduces an adaptive 3D volume construction framework with geometry-aware aggregation and sparse voxel selection, achieving state-of-the-art indoor 3D object detection using only 3D bounding box supervision.
Authors:Jiangjun Peng, Yisi Luo, Xiangyong Cao, Shuang Xu, Deyu Meng
Abstract:
The nuclear norm (NN) has been widely explored in matrix recovery problems, such as Robust PCA and matrix completion, leveraging the inherent global low-rank structure of the data. In this study, we introduce a new modified nuclear norm (MNN) framework, where the MNN family norms are defined by adopting suitable transformations and performing the NN on the transformed matrix. The MNN framework offers two main advantages: (1) it jointly captures both local information and global low-rankness without requiring trade-off parameter tuning; (2) Under mild assumptions on the transformation, we provided exact theoretical recovery guarantees for both Robust PCA and MC tasks-an achievement not shared by existing methods that combine local and global information. Thanks to its general and flexible design, MNN can accommodate various proven transformations, enabling a unified and effective approach to structured low-rank recovery. Extensive experiments demonstrate the effectiveness of our method. Code and supplementary material are available at https://github.com/andrew-pengjj/modified_nuclear_norm.
中文: 本研究提出了一种改进的核范数框架,无需参数调优即可同时捕捉局部信息和全局低秩结构,为鲁棒主成分分析和矩阵补全任务提供了理论保证,并通过实验验证了其有效性。
English: This study introduces a modified nuclear norm framework that captures both local and global low-rank structures without parameter tuning, providing theoretical guarantees and demonstrating effectiveness across various transformations and applications.
Authors:Minje Park, Jeonghwa Lim, Taehyung Yu, Sunghoon Joo
Abstract:
Electrocardiogram (ECG) delineation, the segmentation of meaningful waveform features, is critical for clinical diagnosis. Despite recent advances using deep learning, progress has been limited by the scarcity of publicly available annotated datasets. Semi-supervised learning presents a promising solution by leveraging abundant unlabeled ECG data. In this study, we present SemiSegECG, the first systematic benchmark for semi-supervised semantic segmentation (SemiSeg) in ECG delineation. We curated and unified multiple public datasets, including previously underused sources, to support robust and diverse evaluation. We adopted five representative SemiSeg algorithms from computer vision, implemented them on two different architectures: the convolutional network and the transformer, and evaluated them in two different settings: in-domain and cross-domain. Additionally, we propose ECG-specific training configurations and augmentation strategies and introduce a standardized evaluation framework. Our results show that the transformer outperforms the convolutional network in semi-supervised ECG delineation. We anticipate that SemiSegECG will serve as a foundation for advancing semi-supervised ECG delineation methods and will facilitate further research in this domain.
中文:SemiSegECG首次建立了心电描记半监督语义分割的系统基准,通过整合多源数据和标准化评估框架,证明了基于Transformer的模型在半监督心电波形分割中优于卷积网络。
English: SemiSegECG introduces the first systematic benchmark for semi-supervised semantic segmentation in ECG delineation, demonstrating transformer-based models' superiority over convolutional networks while providing unified datasets and evaluation frameworks.
Authors:Biao Yi, Zekun Fei, Jianing Geng, Tong Li, Lihai Nie, Zheli Liu, Yiming Li
Abstract:
Large reasoning models (LRMs) have emerged as a significant advancement in artificial intelligence, representing a specialized class of large language models (LLMs) designed to tackle complex reasoning tasks. The defining characteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoning capabilities. In this paper, we identify a previously unexplored attack vector against LRMs, which we term "overthinking backdoors". We advance this concept by proposing a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model's reasoning verbosity. Our attack is implemented through a novel data poisoning methodology. It pairs a tunable trigger-where the number of repetitions signals the desired intensity-with a correspondingly verbose CoT response. These responses are programmatically generated by instructing a teacher LLM to inject a controlled number of redundant refinement steps into a correct reasoning process. The approach preserves output correctness, which ensures stealth and establishes the attack as a pure resource-consumption vector. Extensive empirical results on various LRMs demonstrate that our method can reliably trigger a controllable, multi-fold increase in the length of the reasoning process, without degrading the final answer's correctness. Our source code is available at https://github.com/FZaKK/BadReasoner.
中文摘要:本文提出一种新型“过度思考后门”攻击,通过数据投毒使大型推理模型在保持正确输出的同时产生过度冗长的思维链推理,攻击者可通过可调触发器精确控制推理长度。
English Summary: This paper introduces a novel "overthinking backdoor" attack that uses data poisoning to make large reasoning models produce excessively verbose chain-of-thought responses while maintaining correct outputs, allowing attackers to precisely control reasoning length through a tunable trigger mechanism.
Authors:Jincheng Li, Chunyu Xie, Ji Ao, Dawei Leng, Yuhui Yin
Abstract:
Large multimodal models (LMMs) have garnered wide-spread attention and interest within the artificial intelligence research and industrial communities, owing to their remarkable capability in multimodal understanding, reasoning, and in-context learning, among others. While LMMs have demonstrated promising results in tackling multimodal tasks like image captioning, visual question answering, and visual grounding, the object detection capabilities of LMMs exhibit a significant gap compared to specialist detectors. To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose LMM-Det, a simple yet effective approach that leverages a Large Multimodal Model for vanilla object Detection without relying on specialized detection modules. Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models. To mitigate this, we propose to increase the recall rate by introducing data distribution adjustment and inference optimization tailored for object detection. We re-organize the instruction conversations to enhance the object detection capabilities of large multimodal models. We claim that a large multimodal model possesses detection capability without any extra detection modules. Extensive experiments support our claim and show the effectiveness of the versatile LMM-Det. The datasets, models, and codes are available at https://github.com/360CVGroup/LMM-Det.
中文: 本文提出LMM-Det方法,通过数据分布调整和推理优化提升召回率,使大型多模态模型无需专用检测模块即可实现物体检测功能。
English: The paper introduces LMM-Det, a novel approach that enables large multimodal models to perform object detection without specialized modules by improving recall through data distribution adjustments and inference optimization.
Authors:Chenyu Su, Weiwei Shang, Chen Qian, Fei Zhang, Shuang Cong
Abstract:
Semantics-driven 3D spatial constraints align highlevel semantic representations with low-level action spaces, facilitating the unification of task understanding and execution in robotic manipulation. The synergistic reasoning of Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) enables cross-modal 3D spatial constraint construction. Nevertheless, existing methods have three key limitations: (1) coarse semantic granularity in constraint modeling, (2) lack of real-time closed-loop planning, (3) compromised robustness in semantically diverse environments. To address these challenges, we propose ReSem3D, a unified manipulation framework for semantically diverse environments, leveraging the synergy between VFMs and MLLMs to achieve fine-grained visual grounding and dynamically constructs hierarchical 3D spatial constraints for real-time manipulation. Specifically, the framework is driven by hierarchical recursive reasoning in MLLMs, which interact with VFMs to automatically construct 3D spatial constraints from natural language instructions and RGB-D observations in two stages: part-level extraction and region-level refinement. Subsequently, these constraints are encoded as real-time optimization objectives in joint space, enabling reactive behavior to dynamic disturbances. Extensive simulation and real-world experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSem3D performs diverse manipulation tasks under zero-shot conditions, exhibiting strong adaptability and generalization. Code and videos are available at https://github.com/scy-v/ReSem3D and https://resem3d.github.io.
中文: ReSem3D框架通过多模态AI模型的协同作用,从自然语言指令构建精细化的3D空间约束,实现在多样化环境中的实时自适应机器人操作。
English: ReSem3D is a robotic manipulation framework that leverages multimodal AI models to create fine-grained 3D spatial constraints from natural language, enabling real-time adaptive task execution in diverse environments.
Authors:Chengchang Tian, Jianwei Ma, Yan Huang, Zhanye Chen, Honghao Wei, Hui Zhang, Wei Hong
Abstract:
Feature-level fusion shows promise in collaborative perception (CP) through balanced performance and communication bandwidth trade-off. However, its effectiveness critically relies on input feature quality. The acquisition of high-quality features faces domain gaps from hardware diversity and deployment conditions, alongside temporal misalignment from transmission delays. These challenges degrade feature quality with cumulative effects throughout the collaborative network. In this paper, we present the Domain-And-Time Alignment (DATA) network, designed to systematically align features while maximizing their semantic representations for fusion. Specifically, we propose a Consistency-preserving Domain Alignment Module (CDAM) that reduces domain gaps through proximal-region hierarchical downsampling and observability-constrained discriminator. We further propose a Progressive Temporal Alignment Module (PTAM) to handle transmission delays via multi-scale motion modeling and two-stage compensation. Building upon the aligned features, an Instance-focused Feature Aggregation Module (IFAM) is developed to enhance semantic representations. Extensive experiments demonstrate that DATA achieves state-of-the-art performance on three typical datasets, maintaining robustness with severe communication delays and pose errors. The code will be released at https://github.com/ChengchangTian/DATA.
Chinese: 领域与时间对齐(DATA)网络通过专用模块解决协作感知中的领域差异和时间错位问题,在保持对通信延迟和位姿误差鲁棒性的同时,实现了最先进的性能表现。
English: The Domain-And-Time Alignment (DATA) network addresses domain gaps and temporal misalignments in collaborative perception through specialized modules, achieving state-of-the-art performance with robust handling of communication delays and pose errors.
Authors:Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, Shirui Pan
Abstract:
Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains. The effectiveness of MAS is critically dependent on its collaboration topology, which has become a focal point for automated design research. However, existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures, significantly limiting their adaptability to task-specific requirements. To address these limitations, we reframe MAS design as a conditional autoregressive graph generation task, where both the system composition and structure are designed jointly. We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch. Conditioned on a natural language task query, ARG-Designer sequentially and dynamically determines the required number of agents, selects their appropriate roles from an extensible pool, and establishes the optimal communication links between them. This generative approach creates a customized topology in a flexible and extensible manner, precisely tailored to the unique demands of different tasks. Extensive experiments across six diverse benchmarks demonstrate that ARG-Designer not only achieves state-of-the-art performance but also enjoys significantly greater token efficiency and enhanced extensibility. The source code of ARG-Designer is available at https://github.com/Shiy-Li/ARG-Designer.
中文: ARG-Designer采用自回归图生成方法,根据任务需求动态构建多智能体系统,在多个基准测试中实现了卓越性能与高效扩展。
English: ARG-Designer introduces an autoregressive graph generation approach that dynamically constructs multi-agent systems from scratch based on task queries, achieving superior performance and efficiency across diverse benchmarks.
Authors:Minghao Fu, Guo-Hua Wang, Xiaohao Chen, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
Abstract:
Recent advances in text-to-image synthesis largely benefit from sophisticated sampling strategies and classifier-free guidance (CFG) to ensure high-quality generation. However, CFG's reliance on two forward passes, especially when combined with intricate sampling algorithms, results in prohibitively high inference costs. To address this, we introduce TeEFusion (Text Embeddings Fusion), a novel and efficient distillation method that directly incorporates the guidance magnitude into the text embeddings and distills the teacher model's complex sampling strategy. By simply fusing conditional and unconditional text embeddings using linear operations, TeEFusion reconstructs the desired guidance without adding extra parameters, simultaneously enabling the student model to learn from the teacher's output produced via its sophisticated sampling approach. Extensive experiments on state-of-the-art models such as SD3 demonstrate that our method allows the student to closely mimic the teacher's performance with a far simpler and more efficient sampling strategy. Consequently, the student model achieves inference speeds up to 6$\times$ faster than the teacher model, while maintaining image quality at levels comparable to those obtained through the teacher's complex sampling approach. The code is publicly available at https://github.com/AIDC-AI/TeEFusion.
中文摘要:TeEFusion是一种高效的蒸馏方法,通过将引导融入文本嵌入并提炼复杂采样策略,使学生模型在保持与教师模型相当图像质量的同时,推理速度提升高达6倍。
English Summary: TeEFusion is an efficient distillation method that integrates guidance into text embeddings and distills complex sampling strategies, enabling student models to achieve up to 6× faster inference while maintaining image quality comparable to teacher models.
Authors:Shiny Choudhury, Michael Davidson, George Tynan
Abstract:
Nuclear reactors are often modeled as inflexible baseload generators with fixed downtimes and restrictive ramping constraints. In practice, however, a reactor's operational flexibility is closely tied to its fuel cycle and associated reactivity margin. A key physical constraint for power maneuverability is xenon poisoning, caused from the transient buildup of neutron-absorbing xenon following a power reduction. This transient can delay or prevent subsequent power ramp-up due to suppressed core reactivity. Additionally, if a reactor is shutdown during periods of low reactivity, restart times can vary significantly, leading to prolonged downtimes. This work introduces a physics-informed modeling framework that embeds fuel cycle dynamics within a unit commitment (UC) formulation. The framework tracks reactivity margin, dynamically enforces xenon induced constraints, and endogenously schedules refueling outages based on core conditions. By capturing intracycle reactivity evolution, the model enables operation dependent nuclear dispatch that reflects both techno-economic requirements and irreducible nuclear physics limits. Application to a representative reactor fleet shows that flexible operation can slow reactivity degradation and extend fuel cycles. Results further demonstrate that different operational modes substantially affect VRE utilization, curtailment, and nuclear fleet capacity factors. These findings highlight the importance of fuel cycle aware flexibility modeling for accurate reactor scheduling and integration of nuclear power into energy system models.
中文摘要:本研究提出了一种基于物理的核反应堆调度模型,通过将燃料循环动态与运行约束相结合,证明了灵活运行能够缓解氙中毒影响、延长燃料周期并提升可再生能源消纳能力。
English Summary: This study presents a physics-based model integrating fuel cycle dynamics into nuclear reactor scheduling, demonstrating how flexible operations can extend fuel cycles and improve renewable energy integration by addressing xenon poisoning constraints.
Authors:Jinhong He, Minglong Xue, Zhipu Liu, Mingliang Zhou, Aoxiang Ning, Palaiahnakote Shivakumara
Abstract:
Low-light image enhancement aims to improve the visibility of degraded images to better align with human visual perception. While diffusion-based methods have shown promising performance due to their strong generative capabilities. However, their unidirectional modelling of degradation often struggles to capture the complexity of real-world degradation patterns, leading to structural inconsistencies and pixel misalignments. To address these challenges, we propose a bidirectional diffusion optimization mechanism that jointly models the degradation processes of both low-light and normal-light images, enabling more precise degradation parameter matching and enhancing generation quality. Specifically, we perform bidirectional diffusion-from low-to-normal light and from normal-to-low light during training and introduce an adaptive feature interaction block (AFI) to refine feature representation. By leveraging the complementarity between these two paths, our approach imposes an implicit symmetry constraint on illumination attenuation and noise distribution, facilitating consistent degradation learning and improving the models ability to perceive illumination and detail degradation. Additionally, we design a reflection-aware correction module (RACM) to guide color restoration post-denoising and suppress overexposed regions, ensuring content consistency and generating high-quality images that align with human visual perception. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art methods in both quantitative and qualitative evaluations while generalizing effectively to diverse degradation scenarios. Code at https://github.com/hejh8/BidDiff
中文摘要:本文提出了一种双向扩散优化方法,通过联合建模低光与正常光图像的退化过程,结合自适应特征交互和反射感知校正模块,有效提升图像增强的结构一致性与视觉质量。
English Summary: This paper introduces a bidirectional diffusion optimization method for low-light image enhancement that jointly models degradation processes in both low-light and normal-light images, incorporating adaptive feature interaction and reflection-aware correction to improve structural consistency and visual quality.
Authors:Binghua Li, Ziqing Chang, Tong Liang, Chao Li, Toshihisa Tanaka, Shigeki Aoki, Qibin Zhao, Zhe Sun
Abstract:
We address the challenge of parameter-efficient fine-tuning (PEFT) for three-dimensional (3D) U-Net-based denoising diffusion probabilistic models (DDPMs) in magnetic resonance imaging (MRI) image generation. Despite its practical significance, research on parameter-efficient representations of 3D convolution operations remains limited. To bridge this gap, we propose Tensor Volumetric Operator (TenVOO), a novel PEFT method specifically designed for fine-tuning DDPMs with 3D convolutional backbones. Leveraging tensor network modeling, TenVOO represents 3D convolution kernels with lower-dimensional tensors, effectively capturing complex spatial dependencies during fine-tuning with few parameters. We evaluate TenVOO on three downstream brain MRI datasets-ADNI, PPMI, and BraTS2021-by fine-tuning a DDPM pretrained on 59,830 T1-weighted brain MRI scans from the UK Biobank. Our results demonstrate that TenVOO achieves state-of-the-art performance in multi-scale structural similarity index measure (MS-SSIM), outperforming existing approaches in capturing spatial dependencies while requiring only 0.3% of the trainable parameters of the original model. Our code is available at: https://github.com/xiaovhua/tenvoo
中文: 我们提出TenVOO方法,针对三维U-Net扩散模型的参数高效微调,通过张量网络以少量参数表示卷积核,在脑部MRI数据集上实现了最优性能。
English: We propose TenVOO, a parameter-efficient fine-tuning method for 3D U-Net-based DDPMs in MRI generation that uses tensor networks to represent convolution kernels with fewer parameters while achieving superior performance on brain MRI datasets.
Authors:Qianyi He, Yuan Chang Leong
Abstract:
The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies. In this submission, we propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs. Stimulus features were extracted using pretrained models including VideoMAE, HuBERT, Qwen, and BridgeTower. The decoder integrates information from prior brain states and current stimuli via dual cross-attention mechanisms that attend to both perceptual information extracted from the stimulus as well as narrative information provided by high-level summaries of the content. One core innovation of our approach is the use of sequences of multimodal context to predict sequences of brain activity, enabling the model to capture long-range temporal structure in both stimuli and neural responses. Another is the combination of a shared encoder with partial subject-specific decoder, which leverages common representational structure across subjects while accounting for individual variability. Our model achieves strong performance on both in-distribution and out-of-distribution data, demonstrating the effectiveness of temporally-aware, multimodal sequence modeling for brain activity prediction. The code is available at https://github.com/Angelneer926/Algonauts_challenge.
Chinese: 针对Algonauts 2025挑战赛,本研究提出一种序列到序列的Transformer模型,通过预训练模型提取视听语言特征,采用双重交叉注意力机制整合刺激信息与大脑状态,在分布内外数据上均实现了优异的脑活动预测性能。
English: The Algonauts 2025 Challenge submission introduces a sequence-to-sequence Transformer that autoregressively predicts whole-brain fMRI responses by integrating multimodal inputs—visual, auditory, and language—using pretrained models and dual cross-attention mechanisms, achieving robust performance on both in-distribution and out-of-distribution data.
Authors:Pascal Spiegler, Taha Koleilat, Arash Harirpoush, Corey S. Miller, Hassan Rivaz, Marta Kersten-Oertel, Yiming Xiao
Abstract:
Pancreatic cancer carries a poor prognosis and relies on endoscopic ultrasound (EUS) for targeted biopsy and radiotherapy. However, the speckle noise, low contrast, and unintuitive appearance of EUS make segmentation of pancreatic tumors with fully supervised deep learning (DL) models both error-prone and dependent on large, expert-curated annotation datasets. To address these challenges, we present TextSAM-EUS, a novel, lightweight, text-driven adaptation of the Segment Anything Model (SAM) that requires no manual geometric prompts at inference. Our approach leverages text prompt learning (context optimization) through the BiomedCLIP text encoder in conjunction with a LoRA-based adaptation of SAM's architecture to enable automatic pancreatic tumor segmentation in EUS, tuning only 0.86% of the total parameters. On the public Endoscopic Ultrasound Database of the Pancreas, TextSAM-EUS with automatic prompts attains 82.69% Dice and 85.28% normalized surface distance (NSD), and with manual geometric prompts reaches 83.10% Dice and 85.70% NSD, outperforming both existing state-of-the-art (SOTA) supervised DL models and foundation models (e.g., SAM and its variants). As the first attempt to incorporate prompt learning in SAM-based medical image segmentation, TextSAM-EUS offers a practical option for efficient and robust automatic EUS segmentation. Code is available at https://github.com/HealthX-Lab/TextSAM-EUS .
中文摘要:TextSAM-EUS是一种轻量级的文本驱动改进模型,无需手动几何提示即可实现内镜超声中胰腺肿瘤的自动分割,在仅调整极少量参数的情况下,其性能超越了现有最先进方法。
English Summary: TextSAM-EUS is a lightweight, text-adapted version of the Segment Anything Model that enables automatic pancreatic tumor segmentation in endoscopic ultrasound without manual prompts, achieving superior performance over existing methods while tuning only a minimal fraction of parameters.
Authors:Yuqing Shen, Yuanyuan Shi, Daniel Kirschen, Yize Chen
Abstract:
Power systems decarbonization are at the focal point of the clean energy transition. While system operators and utility companies increasingly publicize system-level carbon emission information, it remains unclear how emissions from individual generators are transported through the grid and how they impact electricity users at specific locations. This paper presents a novel and computationally efficient approach for exact quantification of nodal average and marginal carbon emission rates, applicable to both AC and DC optimal power flow problems. The approach leverages graph-based topological sorting and directed cycle removal techniques, applied to directed graphs formed by generation dispatch and optimal power flow solutions. Our proposed algorithm efficiently identifies each generator's contribution to each node, capturing how emissions are spatially distributed under varying system conditions. To validate its effectiveness and reveal locational and temporal emission patterns in the real world, we simulate the 8,870-bus realistic California grid using actual CAISO data and the CATS model. Based on year long hourly data on nodal loads and renewable generation, obtained or estimated from CAISO public data, our method accurately estimates power flow conditions, generation mixes, and systemwide emissions, and delivers fine grained spatiotemporal emission analysis for every California county. Both our algorithm and the California study are open-sourced, providing a foundation for future research on grid emissions, planning, operations, and energy policy.
中文摘要:本文提出了一种计算高效的新方法,通过基于图的拓扑排序和定向循环消除技术,精确量化电网节点的平均与边际碳排放率,并通过对加州电网的实证模拟揭示了碳排放的时空分布规律。
English Summary: This paper introduces a computationally efficient method to precisely calculate average and marginal carbon emissions at grid nodes, using graph-based techniques to track how generator emissions spread through power networks, validated through a comprehensive California grid simulation.
Authors:Xiaoran Sun, Liyan Wang, Cong Wang, Yeying Jin, Kin-man Lam, Zhixun Su, Yang Yang, Jinshan Pan
Abstract:
Most existing low-light image enhancement (LLIE) methods rely on pre-trained model priors, low-light inputs, or both, while neglecting the semantic guidance available from normal-light images. This limitation hinders their effectiveness in complex lighting conditions. In this paper, we propose VLM-IMI, a novel framework that leverages large vision-language models (VLMs) with iterative and manual instructions (IMIs) for LLIE. VLM-IMI incorporates textual descriptions of the desired normal-light content as enhancement cues, enabling semantically informed restoration. To effectively integrate cross-modal priors, we introduce an instruction prior fusion module, which dynamically aligns and fuses image and text features, promoting the generation of detailed and semantically coherent outputs. During inference, we adopt an iterative and manual instruction strategy to refine textual instructions, progressively improving visual quality. This refinement enhances structural fidelity, semantic alignment, and the recovery of fine details under extremely low-light conditions. Extensive experiments across diverse scenarios demonstrate that VLM-IMI outperforms state-of-the-art methods in both quantitative metrics and perceptual quality. The source code is available at https://github.com/sunxiaoran01/VLM-IMI.
VLM-IMI introduces a novel low-light image enhancement framework that leverages vision-language models with iterative manual instructions, using textual descriptions of normal-light content to achieve semantically informed restoration that outperforms existing methods.
English Summary:
Authors:Yueheng Li, Guangming Xie, Zongqing Lu
Abstract:
Due to practical constraints such as partial observability and limited communication, Centralized Training with Decentralized Execution (CTDE) has become the dominant paradigm in cooperative Multi-Agent Reinforcement Learning (MARL). However, existing CTDE methods often underutilize centralized training or lack theoretical guarantees. We propose Multi-Agent Guided Policy Optimization (MAGPO), a novel framework that better leverages centralized training by integrating centralized guidance with decentralized execution. MAGPO uses an auto-regressive joint policy for scalable, coordinated exploration and explicitly aligns it with decentralized policies to ensure deployability under partial observability. We provide theoretical guarantees of monotonic policy improvement and empirically evaluate MAGPO on 43 tasks across 6 diverse environments. Results show that MAGPO consistently outperforms strong CTDE baselines and matches or surpasses fully centralized approaches, offering a principled and practical solution for decentralized multi-agent learning. Our code and experimental data can be found in https://github.com/liyheng/MAGPO.
中文摘要:本文提出MAGPO这一新型CTDE框架,通过将集中式指导与分散执行相结合,在保证理论性能提升的同时实现了更有效的多智能体协同学习,在多个测试环境中显著优于现有方法。
English Summary: The paper introduces MAGPO, a novel CTDE framework that enhances centralized training with guided policy optimization to ensure scalable coordination and theoretical guarantees, outperforming existing methods across diverse environments.
Authors:Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Abstract:
Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model's logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus dispreferred outputs. These tokens are then used to construct directional steering vectors that capture semantic shifts from undesirable to desirable behavior. During inference, GrAInS adjusts hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale. This enables fine-grained, interpretable, and modular control over model behavior, without retraining or auxiliary supervision. Empirically, GrAInS consistently outperforms both fine-tuning and existing steering baselines: it achieves a 13.22% accuracy gain on TruthfulQA using Llama-3.1-8B, reduces hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B, and improves alignment win rates on SPA-VL by 8.11%, all while preserving the model's fluency and general capabilities.
中文: GrAInS提出了一种新颖的推理时引导方法,通过基于梯度的归因分析动态调整模型激活,无需重新训练即可在提升真实性、降低幻觉方面实现显著性能突破。
English: GrAInS introduces a novel inference-time steering method that uses gradient-based token attribution to dynamically adjust model activations, achieving significant performance improvements in truthfulness and reduced hallucinations without retraining.
Authors:Yuezun Li, Delong Zhu, Xinjie Cui, Siwei Lyu
Abstract:
The rapid advancement of AI technologies has significantly increased the diversity of DeepFake videos circulating online, posing a pressing challenge for \textit{generalizable forensics}, \ie, detecting a wide range of unseen DeepFake types using a single model. Addressing this challenge requires datasets that are not only large-scale but also rich in forgery diversity. However, most existing datasets, despite their scale, include only a limited variety of forgery types, making them insufficient for developing generalizable detection methods. Therefore, we build upon our earlier Celeb-DF dataset and introduce {Celeb-DF++}, a new large-scale and challenging video DeepFake benchmark dedicated to the generalizable forensics challenge. Celeb-DF++ covers three commonly encountered forgery scenarios: Face-swap (FS), Face-reenactment (FR), and Talking-face (TF). Each scenario contains a substantial number of high-quality forged videos, generated using a total of 22 various recent DeepFake methods. These methods differ in terms of architectures, generation pipelines, and targeted facial regions, covering the most prevalent DeepFake cases witnessed in the wild. We also introduce evaluation protocols for measuring the generalizability of 24 recent detection methods, highlighting the limitations of existing detection methods and the difficulty of our new dataset.
中文: Celeb-DF++数据集的推出通过涵盖三种常见伪造场景下的22种不同方法,为提升DeepFake检测的泛化能力提供了大规模多样化基准,同时揭示了现有检测方法的局限性。
English: The introduction of the Celeb-DF++ dataset addresses the need for a large-scale and diverse benchmark to improve generalizable DeepFake detection by including 22 different forgery methods across three common scenarios, revealing the limitations of current detection techniques.
Authors:Jaeho Shin, Hyeonjae Gil, Junwoo Jang, Maani Ghaffari, Ayoung Kim
Abstract:
Affine Grassmannian has been favored for expressing proximity between lines and planes due to its theoretical exactness in measuring distances among features. Despite this advantage, the existing method can only measure the proximity without yielding the distance as an explicit function of rigid body transformation. Thus, an optimizable distance function on the manifold has remained underdeveloped, stifling its application in registration problems. This paper is the first to explicitly derive an optimizable cost function between two Grassmannian features with respect to rigid body transformation ($\mathbf{R}$ and $\mathbf{t}$). Specifically, we present a rigorous mathematical proof demonstrating that the bases of high-dimensional linear subspaces can serve as an explicit representation of the cost. Finally, we propose an optimizable cost function based on the transformed bases that can be applied to the registration problem of any affine subspace. Compared to vector parameter-based approaches, our method is able to find a globally optimal solution by directly minimizing the geodesic distance which is agnostic to representation ambiguity. The resulting cost function and its extension to the inlier-set maximizing Branch-and-Bound (BnB) solver have been demonstrated to improve the convergence of existing solutions or outperform them in various computer vision tasks. The code is available on https://github.com/joomeok/GrassmannRegistration.
Chinese: 本文首次在仿射格拉斯曼流形上推导出关于刚体变换的显式可优化代价函数,通过直接最小化测地距离来解决配准问题,并在多种计算机视觉任务中展现出优越性能。
English: This paper introduces the first explicit and optimizable cost function on the affine Grassmannian manifold that directly measures geodesic distance with respect to rigid body transformations, enabling globally optimal solutions for registration problems in computer vision.
Authors:Huy Nguyen, Kien Nguyen, Akila Pemasiri, Akmal Jahan, Clinton Fookes, Sridha Sridharan
Abstract:
Person re-identification (Re-ID) across visible and infrared modalities is crucial for 24-hour surveillance systems, but existing datasets primarily focus on ground-level perspectives. While ground-based IR systems offer nighttime capabilities, they suffer from occlusions, limited coverage, and vulnerability to obstructions--problems that aerial perspectives uniquely solve. To address these limitations, we introduce AG-VPReID.VIR, the first aerial-ground cross-modality video-based person Re-ID dataset. This dataset captures 1,837 identities across 4,861 tracklets (124,855 frames) using both UAV-mounted and fixed CCTV cameras in RGB and infrared modalities. AG-VPReID.VIR presents unique challenges including cross-viewpoint variations, modality discrepancies, and temporal dynamics. Additionally, we propose TCC-VPReID, a novel three-stream architecture designed to address the joint challenges of cross-platform and cross-modality person Re-ID. Our approach bridges the domain gaps between aerial-ground perspectives and RGB-IR modalities, through style-robust feature learning, memory-based cross-view adaptation, and intermediary-guided temporal modeling. Experiments show that AG-VPReID.VIR presents distinctive challenges compared to existing datasets, with our TCC-VPReID framework achieving significant performance gains across multiple evaluation protocols. Dataset and code are available at https://github.com/agvpreid25/AG-VPReID.VIR.
中文: 本文提出了首个空地跨模态视频行人重识别数据集AG-VPReID.VIR,并开发了新型三流TCC-VPReID框架,通过鲁棒特征学习有效解决了跨平台和跨模态的挑战,实现了卓越性能。
English: The authors introduce AG-VPReID.VIR, the first aerial-ground cross-modality video dataset for person re-identification, along with a novel three-stream TCC-VPReID framework that effectively addresses cross-platform and cross-modality challenges through robust feature learning and achieves superior performance.
Authors:Peter Eckmann, Adrian Barnett, Alexandra Bannach-Brown, Elisa Pilar Bascunan Atria, Guillaume Cabanac, Louise Delwen Owen Franzen, MaÅgorzata Anna Gazda, Kaitlyn Hair, James Howison, Halil Kilicoglu, Cyril Labbe, Sarah McCann, Vladislav Nachev, Martijn Roelandse, Maia Salholz-Hillel, Robert Schulz, Gerben ter Riet, Colby Vorland, Anita Bandrowski, Tracey Weissgerber
Abstract:
The causes of the reproducibility crisis include lack of standardization and transparency in scientific reporting. Checklists such as ARRIVE and CONSORT seek to improve transparency, but they are not always followed by authors and peer review often fails to identify missing items. To address these issues, there are several automated tools that have been designed to check different rigor criteria. We have conducted a broad comparison of 11 automated tools across 9 different rigor criteria from the ScreenIT group. We found some criteria, including detecting open data, where the combination of tools showed a clear winner, a tool which performed much better than other tools. In other cases, including detection of inclusion and exclusion criteria, the combination of tools exceeded the performance of any one tool. We also identified key areas where tool developers should focus their effort to make their tool maximally useful. We conclude with a set of insights and recommendations for stakeholders in the development of rigor and transparency detection tools. The code and data for the study is available at https://github.com/PeterEckmann1/tool-comparison.
Chinese: 可重复性危机源于科学报告缺乏标准化和透明度,通过比较11种自动化工具评估严谨性标准发现,工具组合在检测纳入排除标准和开放数据等关键要素时表现更优,并提出了改进工具开发的建议。
English: The reproducibility crisis stems from insufficient standardization and transparency in scientific reporting, prompting the development of automated tools to assess rigor criteria, with a comparison of 11 tools revealing that combining them enhances performance in detecting key elements like inclusion criteria and open data.
Authors:Rui Deng, Ziqi Li, Mingshu Wang
Abstract:
Accurate modeling and explaining geospatial tabular data (GTD) are critical for understanding geospatial phenomena and their underlying processes. Recent work has proposed a novel transformer-based deep learning model named GeoAggregator (GA) for this purpose, and has demonstrated that it outperforms other statistical and machine learning approaches. In this short paper, we further improve GA by 1) developing an optimized pipeline that accelerates the dataloading process and streamlines the forward pass of GA to achieve better computational efficiency; and 2) incorporating a model ensembling strategy and a post-hoc model explanation function based on the GeoShapley framework to enhance model explainability. We validate the functionality and efficiency of the proposed strategies by applying the improved GA model to synthetic datasets. Experimental results show that our implementation improves the prediction accuracy and inference speed of GA compared to the original implementation. Moreover, explanation experiments indicate that GA can effectively captures the inherent spatial effects in the designed synthetic dataset. The complete pipeline has been made publicly available for community use (https://github.com/ruid7181/GA-sklearn).
Chinese: 本研究通过优化数据管道并引入集成策略和基于GeoShapley的解释功能,提升了GeoAggregator模型的预测精度、推理速度和可解释性,在合成数据集上验证了其有效性。
English: The study enhances the GeoAggregator model by optimizing its data pipeline and incorporating ensemble strategies with GeoShapley-based explanations, resulting in improved accuracy, speed, and interpretability on synthetic datasets.
Authors:Deepa Krishnaswamy, Cosmin Ciausu, Steve Pieper, Ron Kikinis, Benjamin Billot, Andrey Fedorov
Abstract:
Recent advances in deep learning have led to robust automated tools for segmentation of abdominal computed tomography (CT). Meanwhile, segmentation of magnetic resonance imaging (MRI) is substantially more challenging due to the inherent signal variability and the increased effort required for annotating training datasets. Hence, existing approaches are trained on limited sets of MRI sequences, which might limit their generalizability. To characterize the landscape of MRI abdominal segmentation tools, we present here a comprehensive benchmarking of the three state-of-the-art and open-source models: MRSegmentator, MRISegmentator-Abdomen, and TotalSegmentator MRI. Since these models are trained using labor-intensive manual annotation cycles, we also introduce and evaluate ABDSynth, a SynthSeg-based model purely trained on widely available CT segmentations (no real images). More generally, we assess accuracy and generalizability by leveraging three public datasets (not seen by any of the evaluated methods during their training), which span all major manufacturers, five MRI sequences, as well as a variety of subject conditions, voxel resolutions, and fields-of-view. Our results reveal that MRSegmentator achieves the best performance and is most generalizable. In contrast, ABDSynth yields slightly less accurate results, but its relaxed requirements in training data make it an alternative when the annotation budget is limited. The evaluation code and datasets are given for future benchmarking at https://github.com/deepakri201/AbdoBench, along with inference code and weights for ABDSynth.
中文: 深度学习进展已实现稳健的腹部CT分割,但MRI分割因信号变异性和标注数据有限而更具挑战性;基准测试显示MRSegmentator性能最优且泛化能力最强,而ABDSynth虽精度稍逊,却因降低标注需求成为实用替代方案。
English: Recent deep learning advances have enabled robust abdominal CT segmentation, but MRI segmentation remains challenging due to signal variability and limited annotated data, leading to a benchmarking study that found MRSegmentator as the most accurate and generalizable model while ABDSynth offers a practical alternative with reduced annotation needs.
Authors:Md. Al-Masrur Khan, Durgakant Pushp, Lantao Liu
Abstract:
In Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS), a model is trained on labeled source domain data (e.g., synthetic images) and adapted to an unlabeled target domain (e.g., real-world images) without access to target annotations. Existing UDA-SS methods often struggle to balance fine-grained local details with global contextual information, leading to segmentation errors in complex regions. To address this, we introduce the Adaptive Feature Refinement (AFR) module, which enhances segmentation accuracy by refining highresolution features using semantic priors from low-resolution logits. AFR also integrates high-frequency components, which capture fine-grained structures and provide crucial boundary information, improving object delineation. Additionally, AFR adaptively balances local and global information through uncertaintydriven attention, reducing misclassifications. Its lightweight design allows seamless integration into HRDA-based UDA methods, leading to state-of-the-art segmentation performance. Our approach improves existing UDA-SS methods by 1.05% mIoU on GTA V --> Cityscapes and 1.04% mIoU on Synthia-->Cityscapes. The implementation of our framework is available at: https://github.com/Masrur02/AFRDA
中文: 自适应特征优化(AFR)模块通过语义先验和高频分量优化高分辨率特征,以不确定性驱动注意力平衡局部与全局信息,从而在无监督领域自适应语义分割中实现最先进的性能。
English: The Adaptive Feature Refinement (AFR) module enhances unsupervised domain adaptive semantic segmentation by refining high-resolution features with semantic priors and high-frequency components, achieving state-of-the-art performance through improved local-global information balance.
Authors:Russell O'Connor, Andrew Poelstra
Abstract:
The modular inverse is an essential piece of computation required for elliptic curve operations used for digital signatures in Bitcoin and other applications. A novel approach to the extended Euclidean algorithm has been developed by Bernstein and Yang within the last few years and incorporated into the libsecp256k1 cryptographic library used by Bitcoin. However, novel algorithms introduce new risks of errors. To address this we have completed a computer verified proof of the correctness of (one of) libsecp256k1's modular inverse implementations with the Coq proof assistant using the Verifiable C's implementation of separation logic.
中文: 比特币libsecp256k1密码库中的新型模逆算法已通过Coq证明助手完成形式化验证,确保了该创新算法的正确性并规避了潜在风险。
English: A novel modular inverse algorithm in Bitcoin's libsecp256k1 library has been formally verified for correctness using the Coq proof assistant, addressing potential risks from its innovative approach.
Authors:Charles H Martin, Christopher Hinrichs
Abstract:
We present a SemiEmpirical Theory of Learning (SETOL) that explains the remarkable performance of State-Of-The-Art (SOTA) Neural Networks (NNs). We provide a formal explanation of the origin of the fundamental quantities in the phenomenological theory of Heavy-Tailed Self-Regularization (HTSR): the heavy-tailed power-law layer quality metrics, alpha and alpha-hat. In prior work, these metrics have been shown to predict trends in the test accuracies of pretrained SOTA NN models, importantly, without needing access to either testing or training data. Our SETOL uses techniques from statistical mechanics as well as advanced methods from random matrix theory and quantum chemistry. The derivation suggests new mathematical preconditions for ideal learning, including a new metric, ERG, which is equivalent to applying a single step of the Wilson Exact Renormalization Group. We test the assumptions and predictions of SETOL on a simple 3-layer multilayer perceptron (MLP), demonstrating excellent agreement with the key theoretical assumptions. For SOTA NN models, we show how to estimate the individual layer qualities of a trained NN by simply computing the empirical spectral density (ESD) of the layer weight matrices and plugging this ESD into our SETOL formulas. Notably, we examine the performance of the HTSR alpha and the SETOL ERG layer quality metrics, and find that they align remarkably well, both on our MLP and on SOTA NNs.
中文摘要:本文提出的半经验学习理论(SETOL)解释了先进神经网络的卓越性能,并在简单和复杂模型上验证了其理论假设,与现有指标高度一致。
English Summary: The paper introduces a SemiEmpirical Theory of Learning (SETOL) that explains the performance of state-of-the-art neural networks and validates its assumptions on both simple and advanced models, showing strong alignment with existing metrics.
Authors:Dou Hoon Kwark, Shirui Luo, Xiyue Zhu, Yudu Li, Zhi-Pei Liang, Volodymyr Kindratenko
Abstract:
Pseudo-healthy image inpainting is an essential preprocessing step for analyzing pathological brain MRI scans. Most current inpainting methods favor slice-wise 2D models for their high in-plane fidelity, but their independence across slices produces discontinuities in the volume. Fully 3D models alleviate this issue, but their high model capacity demands extensive training data for reliable, high-fidelity synthesis -- often impractical in medical settings. We address these limitations with a hierarchical diffusion framework by replacing direct 3D modeling with two perpendicular coarse-to-fine 2D stages. An axial diffusion model first yields a coarse, globally consistent inpainting; a coronal diffusion model then refines anatomical details. By combining perpendicular spatial views with adaptive resampling, our method balances data efficiency and volumetric consistency. Our experiments show our approach outperforms state-of-the-art baselines in both realism and volumetric consistency, making it a promising solution for pseudo-healthy image inpainting. Code is available at https://github.com/dou0000/3dMRI-Consistent-Inpaint.
中文: 本文提出一种分层扩散框架,通过轴向和冠状扩散模型的组合,在保持数据效率的同时实现了优异的体积一致性和解剖细节还原,为伪健康脑部MRI修复提供了创新解决方案。
English: This paper introduces a hierarchical diffusion framework for pseudo-healthy brain MRI inpainting that combines axial and coronal diffusion models to achieve superior volumetric consistency and anatomical detail while maintaining data efficiency.
Authors:Semih Eren, Deniz Kucukahmetler, Nico Scherf
Abstract:
Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.
Chinese: 该分层多模态循环集成模型通过融合视频、音频和语言嵌入的时间动态,有效预测了自然刺激下的皮层响应,在Algonauts 2025挑战赛中表现优异,为未来多模态大脑编码基准建立了可扩展的坚实基础。
English: A hierarchical multimodal recurrent ensemble model effectively predicts cortical responses to naturalistic stimuli by integrating temporal video, audio, and language embeddings, achieving strong performance in the Algonauts 2025 challenge and setting a robust baseline for future brain-encoding benchmarks.
Authors:Semih Eren, Deniz Kucukahmetler, Nico Scherf
Abstract:
Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.
Chinese: 该分层多模态循环集成模型通过融合视频、音频和语言嵌入的时间动态,有效预测了自然刺激下的皮层响应,在Algonauts 2025挑战赛中表现优异,为未来多模态大脑编码基准建立了可扩展的坚实基础。
English: A hierarchical multimodal recurrent ensemble model effectively predicts cortical responses to naturalistic stimuli by integrating temporal video, audio, and language embeddings, achieving strong performance in the Algonauts 2025 challenge and setting a robust baseline for future brain-encoding benchmarks.
Authors:Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Renrui Zhang, Le Zhuo, Tiancheng Han, Xiaoqing Sun, Siqi Luo, Mengmeng Wang, Bin Fu, Yuewen Cao, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Yu Qiao, Peng Gao
Abstract:
We present Lumina-mGPT 2.0, a stand-alone, decoder-only autoregressive model that revisits and revitalizes the autoregressive paradigm for high-quality image generation and beyond. Unlike existing approaches that rely on pretrained components or hybrid architectures, Lumina-mGPT 2.0 is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom. It achieves generation quality on par with state-of-the-art diffusion models such as DALL-E 3 and SANA, while preserving the inherent flexibility and compositionality of autoregressive modeling. Our unified tokenization scheme allows the model to seamlessly handle a wide spectrum of tasks-including subject-driven generation, image editing, controllable synthesis, and dense prediction-within a single generative framework. To further boost usability, we incorporate efficient decoding strategies like inference-time scaling and speculative Jacobi sampling to improve quality and speed, respectively. Extensive evaluations on standard text-to-image benchmarks (e.g., GenEval, DPG) demonstrate that Lumina-mGPT 2.0 not only matches but in some cases surpasses diffusion-based models. Moreover, we confirm its multi-task capabilities on the Graph200K benchmark, with the native Lumina-mGPT 2.0 performing exceptionally well. These results position Lumina-mGPT 2.0 as a strong, flexible foundation model for unified multimodal generation. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-mGPT-2.0.
中文: Lumina-mGPT 2.0 是一个独立的自回归模型,在图像生成质量上达到了与DALL-E 3相媲美的先进水平,同时通过统一框架实现了架构自由和多任务处理能力。
English: Lumina-mGPT 2.0 is a standalone autoregressive model that achieves state-of-the-art image generation quality comparable to DALL-E 3 while offering architectural freedom and multi-task capabilities within a unified framework.
Authors:Shiyuan Zhang, Tong Li, Zhu Xiao, Hongyang Du, Kaibin Huang
Abstract:
Service-level mobile traffic prediction for individual users is essential for network efficiency and quality of service enhancement. However, current prediction methods are limited in their adaptability across different urban environments and produce inaccurate results due to the high uncertainty in personal traffic patterns, the lack of detailed environmental context, and the complex dependencies among different network services. These challenges demand advanced modeling techniques that can capture dynamic traffic distributions and rich environmental features. Inspired by the recent success of diffusion models in distribution modeling and Large Language Models (LLMs) in contextual understanding, we propose an LLM-Enhanced Spatio-temporal Diffusion Model (LSDM). LSDM integrates the generative power of diffusion models with the adaptive learning capabilities of transformers, augmented by the ability to capture multimodal environmental information for modeling service-level patterns and dynamics. Extensive evaluations on real-world service-level datasets demonstrate that the model excels in traffic usage predictions, showing outstanding generalization and adaptability. After incorporating contextual information via LLM, the performance improves by at least 2.83% in terms of the coefficient of determination. Compared to models of a similar type, such as CSDI, the root mean squared error can be reduced by at least 8.29%. The code and dataset will be available at: https://github.com/SoftYuaneR/LSDM.
中文摘要:提出的LLM增强时空扩散模型(LSDM)通过结合扩散模型的生成能力与变换器的自适应学习,有效解决了移动流量预测中的适应性不足问题,并利用大语言模型提升上下文理解能力,显著提高了预测性能。
English Summary: The proposed LLM-Enhanced Spatio-temporal Diffusion Model (LSDM) effectively addresses mobile traffic prediction challenges by combining diffusion models' generative capabilities with transformers' adaptive learning, achieving significant performance improvements through enhanced contextual understanding.
Authors:Camille Challier, Xiaowu Sun, Thabo Mahendiran, Ortal Senouf, Bernard De Bruyne, Denise Auberson, Olivier Müller, Stephane Fournier, Pascal Frossard, Emmanuel Abbé, Dorina Thanou
Abstract:
Accurate segmentation of coronary arteries remains a significant challenge in clinical practice, hindering the ability to effectively diagnose and manage coronary artery disease. The lack of large, annotated datasets for model training exacerbates this issue, limiting the development of automated tools that could assist radiologists. To address this, we introduce CM-UNet, which leverages self-supervised pre-training on unannotated datasets and transfer learning on limited annotated data, enabling accurate disease detection while minimizing the need for extensive manual annotations. Fine-tuning CM-UNet with only 18 annotated images instead of 500 resulted in a 15.2% decrease in Dice score, compared to a 46.5% drop in baseline models without pre-training. This demonstrates that self-supervised learning can enhance segmentation performance and reduce dependence on large datasets. This is one of the first studies to highlight the importance of self-supervised learning in improving coronary artery segmentation from X-ray angiography, with potential implications for advancing diagnostic accuracy in clinical practice. By enhancing segmentation accuracy in X-ray angiography images, the proposed approach aims to improve clinical workflows, reduce radiologists' workload, and accelerate disease detection, ultimately contributing to better patient outcomes. The source code is publicly available at https://github.com/CamilleChallier/Contrastive-Masked-UNet.
中文:CM-UNet通过自监督学习,仅需少量标注数据即可显著提升冠状动脉分割精度,有效降低对大型数据集的依赖,并提高临床诊断效率。
English: CM-UNet utilizes self-supervised learning to significantly improve coronary artery segmentation accuracy with minimal annotated data, reducing reliance on large datasets and enhancing clinical diagnostic efficiency.
Authors:Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang
Abstract:
The automatic generation of deep learning (DL) kernels using large language models (LLMs) has emerged as a promising approach to reduce the manual effort and hardware-specific expertise required for writing high-performance operator implementations. However, existing benchmarks for evaluating LLMs in this domain suffer from limited hardware support, coarse-grained kernel categorization, and imbalanced task coverage. To address these limitations, we introduce MultiKernelBench, the first comprehensive, multi-platform benchmark for LLM-based DL kernel generation. MultiKernelBench spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms: Nvidia GPUs, Huawei NPUs, and Google TPUs. To enable future extensibility, we design a modular backend abstraction layer that decouples platform-specific logic from the core benchmarking infrastructure, allowing easy integration of new hardware platforms. We further propose a simple yet effective category-aware one-shot prompting method that improves generation quality by providing in-category exemplars. Through systematic evaluations of seven state-of-the-art LLMs, we reveal significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies. MultiKernelBench is publicly available at https://github.com/wzzll123/MultiKernelBench.
中文摘要:MultiKernelBench作为首个全面的多平台基准测试,用于评估基于大语言模型的深度学习内核生成,通过支持三大硬件平台和改进的类别感知提示方法,解决了现有基准测试的局限性并提升了生成质量。
English Summary: MultiKernelBench is introduced as the first comprehensive, multi-platform benchmark for evaluating LLM-based deep learning kernel generation, addressing limitations in existing benchmarks by supporting three major hardware platforms and proposing a category-aware prompting method that improves generation quality.
Authors:Weixin Chen, Yuhan Zhao, Li Chen, Weike Pan
Abstract:
Cross-domain recommendation (CDR) methods predominantly leverage overlapping users to transfer knowledge from a source domain to a target domain. However, through empirical studies, we uncover a critical bias inherent in these approaches: while overlapping users experience significant enhancements in recommendation quality, non-overlapping users benefit minimally and even face performance degradation. This unfairness may erode user trust, and, consequently, negatively impact business engagement and revenue. To address this issue, we propose a novel solution that generates virtual source-domain users for non-overlapping target-domain users. Our method utilizes a dual attention mechanism to discern similarities between overlapping and non-overlapping users, thereby synthesizing realistic virtual user embeddings. We further introduce a limiter component that ensures the generated virtual users align with real-data distributions while preserving each user's unique characteristics. Notably, our method is model-agnostic and can be seamlessly integrated into any CDR model. Comprehensive experiments conducted on three public datasets with five CDR baselines demonstrate that our method effectively mitigates the CDR non-overlapping user bias, without loss of overall accuracy. Our code is publicly available at https://github.com/WeixinChen98/VUG.
中文: 该研究揭示了跨领域推荐系统中非重叠用户获益甚微的偏见,并提出了一种模型无关的虚拟用户生成方法,在保持整体准确性的同时有效缓解了这一偏见问题。
English: The study identifies a bias in cross-domain recommendation systems where non-overlapping users receive minimal benefits and proposes a model-agnostic solution using virtual user generation to address this issue while maintaining overall accuracy.
Authors:Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang
Abstract:
Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.
中文: Yume是一个通过图像生成动态虚拟世界的交互系统,其框架融合了相机运动量化、视频扩散变换器和抗伪影机制,支持用户通过键盘操作探索虚拟环境。
English: Yume is an interactive system that generates dynamic virtual worlds from images, enabling exploration via keyboard controls through a framework integrating camera motion quantization, video diffusion transformers, and artifact reduction techniques.
Authors:Zihao Li, Zhichen Zeng, Xiao Lin, Feihao Fang, Yanru Qu, Zhe Xu, Zhining Liu, Xuying Ning, Tianxin Wei, Ge Liu, Hanghang Tong, Jingrui He
Abstract:
Over the past decade, advances in generative modeling, such as generative adversarial networks, masked autoencoders, and diffusion models, have significantly transformed biological research and discovery, enabling breakthroughs in molecule design, protein generation, drug discovery, and beyond. At the same time, biological applications have served as valuable testbeds for evaluating the capabilities of generative models. Recently, flow matching has emerged as a powerful and efficient alternative to diffusion-based generative modeling, with growing interest in its application to problems in biology and life sciences. This paper presents the first comprehensive survey of recent developments in flow matching and its applications in biological domains. We begin by systematically reviewing the foundations and variants of flow matching, and then categorize its applications into three major areas: biological sequence modeling, molecule generation and design, and peptide and protein generation. For each, we provide an in-depth review of recent progress. We also summarize commonly used datasets and software tools, and conclude with a discussion of potential future directions. The corresponding curated resources are available at https://github.com/Violet24K/Awesome-Flow-Matching-Meets-Biology.
中文摘要:本文首次系统综述了新兴的流匹配生成模型技术,涵盖其理论基础及其在生物序列建模、分子设计和蛋白质生成三大领域的应用进展。
English Summary: This paper provides the first comprehensive survey of flow matching, an emerging generative modeling technique, detailing its foundations and applications across biological sequence modeling, molecule design, and protein generation.
Authors:Jialiang Wang, Xianming Liu, Xiong Zhou, Gangfeng Hu, Deming Zhai, Junjun Jiang, Xiangyang Ji
Abstract:
Learning with noisy labels is a crucial task for training accurate deep neural networks. To mitigate label noise, prior studies have proposed various robust loss functions, particularly symmetric losses. Nevertheless, symmetric losses usually suffer from the underfitting issue due to the overly strict constraint. To address this problem, the Active Passive Loss (APL) jointly optimizes an active and a passive loss to mutually enhance the overall fitting ability. Within APL, symmetric losses have been successfully extended, yielding advanced robust loss functions. Despite these advancements, emerging theoretical analyses indicate that asymmetric losses, a new class of robust loss functions, possess superior properties compared to symmetric losses. However, existing asymmetric losses are not compatible with advanced optimization frameworks such as APL, limiting their potential and applicability. Motivated by this theoretical gap and the prospect of asymmetric losses, we extend the asymmetric loss to the more complex passive loss scenario and propose the Asymetric Mean Square Error (AMSE), a novel asymmetric loss. We rigorously establish the necessary and sufficient condition under which AMSE satisfies the asymmetric condition. By substituting the traditional symmetric passive loss in APL with our proposed AMSE, we introduce a novel robust loss framework termed Joint Asymmetric Loss (JAL). Extensive experiments demonstrate the effectiveness of our method in mitigating label noise. Code available at: https://github.com/cswjl/joint-asymmetric-loss
中文: 本文提出了一种新颖的非对称均方误差(AMSE)作为稳健损失函数,以解决对称损失在处理噪声标签时的不足,并将其融入主动被动损失框架中形成联合非对称损失(JAL),通过大量实验验证了其有效性。
English: This paper introduces the Asymmetric Mean Square Error (AMSE) as a novel robust loss function to address the limitations of symmetric losses in handling noisy labels, integrating it into the Active Passive Loss framework to form the Joint Asymmetric Loss (JAL), which demonstrates effectiveness through extensive experiments.
Authors:Daiqi Liu, Tomás Arias-Vergara, Jana Hutter, Andreas Maier, Paula Andrea Pérez-Toro
Abstract:
Accurate classification of articulatory-phonological features plays a vital role in understanding human speech production and developing robust speech technologies, particularly in clinical contexts where targeted phonemic analysis and therapy can improve disease diagnosis accuracy and personalized rehabilitation. In this work, we propose a multimodal deep learning framework that combines real-time magnetic resonance imaging (rtMRI) and speech signals to classify three key articulatory dimensions: manner of articulation, place of articulation, and voicing. We perform classification on 15 phonological classes derived from the aforementioned articulatory dimensions and evaluate the system with four audio/vision configurations: unimodal rtMRI, unimodal audio signals, multimodal middle fusion, and contrastive learning-based audio-vision fusion. Experimental results on the USC-TIMIT dataset show that our contrastive learning-based approach achieves state-of-the-art performance, with an average F1-score of 0.81, representing an absolute increase of 0.23 over the unimodal baseline. The results confirm the effectiveness of contrastive representation learning for multimodal articulatory analysis. Our code and processed dataset will be made publicly available at https://github.com/DaE-plz/AC_Contrastive_Phonology to support future research.
中文: 本研究提出了一种结合实时磁共振成像和语音信号的多模态深度学习框架,通过对比学习方法对发音特征进行分类,以0.81的F1分数实现了最先进的性能。
English: This study introduces a multimodal deep learning framework integrating real-time MRI and speech signals to classify articulatory features, achieving state-of-the-art performance with a 0.81 F1-score through contrastive learning.
Authors:Jiahui Yin, Xinxing Cheng, Jinming Duan, Yan Pang, Declan O'Regan, Hadrien Reynaud, Qingjie Meng
Abstract:
Myocardial motion tracking is important for assessing cardiac function and diagnosing cardiovascular diseases, for which cine cardiac magnetic resonance (CMR) has been established as the gold standard imaging modality. Many existing methods learn motion from single image pairs consisting of a reference frame and a randomly selected target frame from the cardiac cycle. However, these methods overlook the continuous nature of cardiac motion and often yield inconsistent and non-smooth motion estimations. In this work, we propose a novel Mamba-based cardiac motion tracking network (MCM) that explicitly incorporates target image sequence from the cardiac cycle to achieve smooth and temporally consistent motion tracking. By developing a bi-directional Mamba block equipped with a bi-directional scanning mechanism, our method facilitates the estimation of plausible deformation fields. With our proposed motion decoder that integrates motion information from frames adjacent to the target frame, our method further enhances temporal coherence. Moreover, by taking advantage of Mamba's structured state-space formulation, the proposed method learns the continuous dynamics of the myocardium from sequential images without increasing computational complexity. We evaluate the proposed method on two public datasets. The experimental results demonstrate that the proposed method quantitatively and qualitatively outperforms both conventional and state-of-the-art learning-based cardiac motion tracking methods. The code is available at https://github.com/yjh-0104/MCM.
中文: 本研究提出的基于Mamba的心脏运动追踪网络(MCM)通过整合连续心脏图像序列和双向扫描机制,实现了平滑且时序一致的运动追踪,在精度和连贯性上均优于现有方法。
English: The proposed Mamba-based cardiac motion tracking network (MCM) utilizes sequential cardiac images and a bi-directional scanning mechanism to achieve smooth, temporally consistent motion estimation, outperforming existing methods in accuracy and coherence.
Authors:Xuzhi Wang, Xinran Wu, Song Wang, Lingdong Kong, Ziping Zhao
Abstract:
Monocular Semantic Scene Completion (MSSC) aims to predict the voxel-wise occupancy and semantic category from a single-view RGB image. Existing methods adopt a single-stage framework that aims to simultaneously achieve visible region segmentation and occluded region hallucination, while also being affected by inaccurate depth estimation. Such methods often achieve suboptimal performance, especially in complex scenes. We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network. Specifically, we propose the Masked Sparse Gated Recurrent Unit (MS-GRU) which concentrates on the occupied regions by the proposed mask updating mechanism, and a sparse GRU design is proposed to reduce the computation cost. Additionally, we propose the distance attention projection to reduce projection errors by assigning different attention scores according to the distance to the observed surface. Experimental results demonstrate that our proposed unified framework, MonoMRN, effectively supports both indoor and outdoor scenes and achieves state-of-the-art performance on the NYUv2 and SemanticKITTI datasets. Furthermore, we conduct robustness analysis under various disturbances, highlighting the role of the Masked Recurrent Network in enhancing the model's resilience to such challenges. The source code is publicly available.
中文: 提出的两阶段MonoMRN框架通过掩膜循环网络和距离注意力投影技术,在单目语义场景补全任务中实现了最优性能,能有效处理室内外场景并显著提升模型抗干扰能力。
English: The proposed two-stage MonoMRN framework, featuring a Masked Recurrent Network with distance attention projection, achieves state-of-the-art monocular semantic scene completion by efficiently handling both indoor and outdoor scenes while improving robustness against disturbances.
Authors:Yang Li, Zongzheng Zhang, Xuchong Qiu, Xinrun Li, Ziming Liu, Leichen Wang, Ruikai Li, Zhenxin Zhu, Huan-ang Gao, Xiaojian Lin, Zhiyong Cui, Hang Zhao, Hao Zhao
Abstract:
Understanding lane toplogy relationships accurately is critical for safe autonomous driving. However, existing two-stage methods suffer from inefficiencies due to error propagations and increased computational overheads. To address these challenges, we propose a one-stage architecture that simultaneously predicts traffic elements, lane centerlines and topology relationship, improving both the accuracy and inference speed of lane topology understanding for autonomous driving. Our key innovation lies in reusing intermediate attention resources within distinct transformer decoders. This approach effectively leverages the inherent relational knowledge within the element detection module to enable the modeling of topology relationships among traffic elements and lanes without requiring additional computationally expensive graph networks. Furthermore, we are the first to demonstrate that knowledge can be distilled from models that utilize standard definition (SD) maps to those operates without using SD maps, enabling superior performance even in the absence of SD maps. Extensive experiments on the OpenLane-V2 dataset show that our approach outperforms baseline methods in both accuracy and efficiency, achieving superior results in lane detection, traffic element identification, and topology reasoning. Our code is available at https://github.com/Yang-Li-2000/one-stage.git.
中文: 本文提出了一种单阶段架构,通过共享Transformer解码器的注意力资源同时预测交通要素、车道中心线和拓扑关系,不仅提高了准确性和推理速度,还无需昂贵图网络即可建模拓扑关系,并首次实现了从标准地图模型到无地图模型的知识蒸馏。
English: This paper introduces a one-stage architecture that simultaneously predicts traffic elements, lane centerlines, and topology relationships using shared transformer decoder attention, improving accuracy and speed while eliminating the need for costly graph networks and enabling knowledge transfer from SD map models.
Authors:Xinyao Liu, Diping Song
Abstract:
Multimodal large language models (MLLMs) demonstrate significant potential in the field of medical diagnosis. However, they face critical challenges in specialized domains such as ophthalmology, particularly the fragmentation of annotation granularity and inconsistencies in clinical reasoning logic, which hinder precise cross-modal understanding. This paper introduces FundusExpert, an ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities, along with FundusGen, a dataset constructed through the intelligent Fundus-Engine system. Fundus-Engine automates localization and leverages MLLM-based semantic expansion to integrate global disease classification, local object detection, and fine-grained feature analysis within a single fundus image. Additionally, by constructing a clinically aligned cognitive chain, it guides the model to generate interpretable reasoning paths. FundusExpert, fine-tuned with instruction data from FundusGen, achieves the best performance in ophthalmic question-answering tasks, surpassing the average accuracy of the 40B MedRegA by 26.6%. It also excels in zero-shot report generation tasks, achieving a clinical consistency of 77.0%, significantly outperforming GPT-4o's 47.6%. Furthermore, we reveal a scaling law between data quality and model capability ($L \propto N^{0.068}$), demonstrating that the cognitive alignment annotations in FundusGen enhance data utilization efficiency. By integrating region-level localization with diagnostic reasoning chains, our work develops a scalable, clinically-aligned MLLM and explores a pathway toward bridging the visual-language gap in specific MLLMs. Our project can be found at https://github.com/MeteorElf/FundusExpert.
中文: FundusExpert是一种专为眼科设计的先进多模态大语言模型,通过整合定位与诊断推理能力,有效解决了标注粒度碎片化和临床逻辑不一致的问题,在医学任务中表现出卓越性能。
English: FundusExpert is an advanced multimodal large language model designed for ophthalmology, integrating positioning and diagnostic reasoning to achieve superior performance in medical tasks by addressing annotation fragmentation and clinical reasoning inconsistencies.
Authors:Jiahao Tang, Youjun Li, Xiangting Fan, Yangxuan Zheng, Siyuan Lu, Xueping Li, Peng Fang, Chenxi Li, Zi-Gang Huang
Abstract:
Emotion recognition based on electroencephalography (EEG) holds significant promise for affective brain-computer interfaces (aBCIs). However, its practical deployment faces challenges due to the variability within inter-subject and the scarcity of labeled data in target domains. To overcome these limitations, we propose SDC-Net, a novel Semantic-Dynamic Consistency domain adaptation network for fully label-free cross-subject EEG emotion recognition. First, we introduce a Same-Subject Same-Trial Mixup strategy that generates augmented samples through intra-trial interpolation, enhancing data diversity while explicitly preserving individual identity to mitigate label ambiguity. Second, we construct a dynamic distribution alignment module within the Reproducing Kernel Hilbert Space (RKHS), jointly aligning marginal and conditional distributions through multi-objective kernel mean embedding, and leveraging a confidence-aware pseudo-labeling strategy to ensure stable adaptation. Third, we propose a dual-domain similarity consistency learning mechanism that enforces cross-domain structural constraints based on latent pairwise similarities, facilitating semantic boundary learning without reliance on temporal synchronization or label priors. To validate the effectiveness and robustness of the proposed SDC-Net, extensive experiments are conducted on three widely used EEG benchmark datasets: SEED, SEED-IV, and FACED. Comparative results against existing unsupervised domain adaptation methods demonstrate that SDC-Net achieves state-of-the-art performance in emotion recognition under both cross-subject and cross-session conditions. This advancement significantly improves the accuracy and generalization capability of emotion decoding, laying a solid foundation for real-world applications of personalized aBCIs. The source code is available at: https://github.com/XuanSuTrum/SDC-Net.
Chinese: 提出的SDC-Net通过混合增强策略、动态分布对齐和双域一致性学习,实现了无需标注数据的跨被试脑电情绪识别,在多个基准数据集上取得了最优性能。
English: The proposed SDC-Net introduces a domain adaptation framework for cross-subject EEG emotion recognition that employs mixup strategies, dynamic distribution alignment, and dual-domain consistency learning to achieve state-of-the-art performance without requiring labeled target data.
Authors:Kostas Karakontis, Thanos Petsanis, Athanasios Ch. Kapoutsis, Pavlos Ch. Kapoutsis, Elias B. Kosmatopoulos
Abstract:
Multi-UAV Coverage Path Planning (mCPP) algorithms in popular commercial software typically treat a Region of Interest (RoI) only as a 2D plane, ignoring important3D structure characteristics. This leads to incomplete 3Dreconstructions, especially around occluded or vertical surfaces. In this paper, we propose a modular algorithm that can extend commercial two-dimensional path planners to facilitate terrain-aware planning by adjusting altitude and camera orientations. To demonstrate it, we extend the well-known DARP (Divide Areas for Optimal Multi-Robot Coverage Path Planning) algorithm and produce DARP-3D. We present simulation results in multiple 3D environments and a real-world flight test using DJI hardware. Compared to baseline, our approach consistently captures improved 3D reconstructions, particularly in areas with significant vertical features. An open-source implementation of the algorithm is available here:https://github.com/konskara/TerraPlan
中文: 本文提出一种模块化算法,可将商业二维路径规划器扩展为地形感知规划,通过调整无人机高度和相机方向,显著提升三维重建效果,尤其在垂直特征区域,并通过仿真和实际飞行测试验证了其优越性。
English: The paper introduces a modular algorithm that enhances commercial 2D path planners for multi-UAV coverage by incorporating terrain-aware adjustments to altitude and camera angles, resulting in improved 3D reconstructions, especially on vertical surfaces, as validated through simulations and real-world tests.
Authors:Jiajun Luo, Yicheng Xiao, Jianru Xu, Yangxiu You, Rongwei Lu, Chen Tang, Jingyan Jiang, Zhi Wang
Abstract:
Diffusion models produce realistic images and videos but require substantial computational resources, necessitating multi-accelerator parallelism for real-time deployment. However, parallel inference introduces significant communication overhead from exchanging large activations between devices, limiting efficiency and scalability. We present CompactFusion, a compression framework that significantly reduces communication while preserving generation quality. Our key observation is that diffusion activations exhibit strong temporal redundancy-adjacent steps produce highly similar activations, saturating bandwidth with near-duplicate data carrying little new information. To address this inefficiency, we seek a more compact representation that encodes only the essential information. CompactFusion achieves this via Residual Compression that transmits only compressed residuals (step-wise activation differences). Based on empirical analysis and theoretical justification, we show that it effectively removes redundant data, enabling substantial data reduction while maintaining high fidelity. We also integrate lightweight error feedback to prevent error accumulation. CompactFusion establishes a new paradigm for parallel diffusion inference, delivering lower latency and significantly higher generation quality than prior methods. On 4xL20, it achieves 3.0x speedup while greatly improving fidelity. It also uniquely supports communication-heavy strategies like sequence parallelism on slow networks, achieving 6.7x speedup over prior overlap-based method. CompactFusion applies broadly across diffusion models and parallel settings, and integrates easily without requiring pipeline rework. Portable implementation demonstrated on xDiT is publicly available at https://github.com/Cobalt-27/CompactFusion
中文: CompactFusion是一种压缩框架,通过在设备间仅传输压缩残差来减少并行扩散模型推理中的通信开销,在保持生成质量的同时实现显著加速。
English: CompactFusion is a compression framework that reduces communication overhead in parallel diffusion model inference by transmitting only compressed residuals between devices, achieving significant speedups while maintaining generation quality.
Authors:Jorgen Cani, Christos Diou, Spyridon Evangelatos, Vasileios Argyriou, Panagiotis Radoglou-Grammatikis, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos
Abstract:
Automated X-ray inspection is crucial for efficient and unobtrusive security screening in various public settings. However, challenges such as object occlusion, variations in the physical properties of items, diversity in X-ray scanning devices, and limited training data hinder accurate and reliable detection of illicit items. Despite the large body of research in the field, reported experimental evaluations are often incomplete, with frequently conflicting outcomes. To shed light on the research landscape and facilitate further research, a systematic, detailed, and thorough comparative evaluation of recent Deep Learning (DL)-based methods for X-ray object detection is conducted. For this, a comprehensive evaluation framework is developed, composed of: a) Six recent, large-scale, and widely used public datasets for X-ray illicit item detection (OPIXray, CLCXray, SIXray, EDS, HiXray, and PIDray), b) Ten different state-of-the-art object detection schemes covering all main categories in the literature, including generic Convolutional Neural Network (CNN), custom CNN, generic transformer, and hybrid CNN-transformer architectures, and c) Various detection (mAP50 and mAP50:95) and time/computational-complexity (inference time (ms), parameter size (M), and computational load (GFLOPS)) metrics. A thorough analysis of the results leads to critical observations and insights, emphasizing key aspects such as: a) Overall behavior of the object detection schemes, b) Object-level detection performance, c) Dataset-specific observations, and d) Time efficiency and computational complexity analysis. To support reproducibility of the reported experimental results, the evaluation code and model weights are made publicly available at https://github.com/jgenc/xray-comparative-evaluation.
Automated X-ray inspection faces challenges like object occlusion and limited training data, prompting a systematic evaluation of deep learning methods using six datasets and ten detection schemes to provide critical insights and publicly available resources.
English Summary:
Authors:Minglong Xue, Aoxiang Ning, Shivakumara Palaiahnakote, Mingliang Zhou
Abstract:
Strong light sources in nighttime photography frequently produce flares in images, significantly degrading visual quality and impacting the performance of downstream tasks. While some progress has been made, existing methods continue to struggle with removing large-scale flare artifacts and repairing structural damage in regions near the light source. We observe that these challenging flare artifacts exhibit more significant discrepancies from the reference images in the frequency domain compared to the spatial domain. Therefore, this paper presents a novel dynamic frequency-guided deflare network (DFDNet) that decouples content information from flare artifacts in the frequency domain, effectively removing large-scale flare artifacts. Specifically, DFDNet consists mainly of a global dynamic frequency-domain guidance (GDFG) module and a local detail guidance module (LDGM). The GDFG module guides the network to perceive the frequency characteristics of flare artifacts by dynamically optimizing global frequency domain features, effectively separating flare information from content information. Additionally, we design an LDGM via a contrastive learning strategy that aligns the local features of the light source with the reference image, reduces local detail damage from flare removal, and improves fine-grained image restoration. The experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods in terms of performance. The code is available at \href{https://github.com/AXNing/DFDNet}{https://github.com/AXNing/DFDNet}.
中文摘要:本文提出DFDNet,一种动态频率引导的去眩光网络,通过在频域分离内容与眩光并结合对比学习优化局部细节,有效消除夜间图像中的大规模眩光伪影。
English Summary: The paper introduces DFDNet, a dynamic frequency-guided deflare network that effectively removes large-scale flare artifacts in nighttime images by decoupling content from flares in the frequency domain and enhancing local details through contrastive learning.
Authors:Junhua Liu, Roy Ka-Wei Lee, Kwan Hui Lim
Abstract:
Human decision-making in high-stakes domains often relies on expertise and heuristics, but is vulnerable to hard-to-detect cognitive biases that threaten fairness and long-term outcomes. This work presents a novel approach to enhancing complex decision-making workflows through the integration of hierarchical learning alongside various enhancements. Focusing on university admissions as a representative high-stakes domain, we propose BGM-HAN, an enhanced Byte-Pair Encoded, Gated Multi-head Hierarchical Attention Network, designed to effectively model semi-structured applicant data. BGM-HAN captures multi-level representations that are crucial for nuanced assessment, improving both interpretability and predictive performance. Experimental results on real admissions data demonstrate that our proposed model significantly outperforms both state-of-the-art baselines from traditional machine learning to large language models, offering a promising framework for augmenting decision-making in domains where structure, context, and fairness matter. Source code is available at: https://github.com/junhua/bgm-han.
中文: 本研究提出了BGM-HAN分层注意力网络,通过建模多层级数据表征来提升大学招生等高风险领域的决策质量,有效改善公平性和预测准确性。
English: This study introduces BGM-HAN, a hierarchical attention network that enhances decision-making in high-stakes domains like university admissions by modeling multi-level data representations to improve fairness and predictive accuracy.
Authors:Francesco Tonini, Lorenzo Vaquero, Alessandro Conti, Cigdem Beyan, Elisa Ricci
Abstract:
Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions. Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues. These annotations are labor-intensive to create, prone to inconsistency, and limit scalability to new domains and rare interactions. We argue that recent advances in Vision-Language Models (VLMs) offer untapped potential, particularly in enhancing interaction representation. While prior work has injected such potential and even proposed training-free methods, there remain key gaps. Consequently, we propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics (DYSCO) that effectively utilizes textual and visual interaction representations within a multimodal registry, enabling robust and nuanced interaction understanding. This registry incorporates a small set of visual cues and uses innovative interaction signatures to improve the semantic alignment of verbs, facilitating effective generalization to rare interactions. Additionally, we propose a unique multi-head attention mechanism that adaptively weights the contributions of the visual and textual features. Experimental results demonstrate that our DYSCO surpasses training-free state-of-the-art models and is competitive with training-based approaches, particularly excelling in rare interactions. Code is available at https://github.com/francescotonini/dysco.
中文: 摘要提出DYSCO,一种无需训练的人物交互检测框架,通过多模态注册表和自适应特征加权机制利用视觉语言模型增强交互表征,在罕见交互场景中表现尤为突出。
English: The abstract introduces DYSCO, a training-free Human-Object Interaction detection framework that leverages Vision-Language Models to enhance interaction representation through a multimodal registry and adaptive feature weighting, achieving superior performance especially in rare interactions.
Authors:Sneha George Gnanakalavathy, Hairil Abdul Razak, Robert Meertens, Jonathan E. Fieldsend, Xujiong Ye, Mohammed M. Abdelsamea
Abstract:
In computed tomography (CT), achieving high image quality while minimizing radiation exposure remains a key clinical challenge. This paper presents CAPRI-CT, a novel causal-aware deep learning framework for Causal Analysis and Predictive Reasoning for Image Quality Optimization in CT imaging. CAPRI-CT integrates image data with acquisition metadata (such as tube voltage, tube current, and contrast agent types) to model the underlying causal relationships that influence image quality. An ensemble of Variational Autoencoders (VAEs) is employed to extract meaningful features and generate causal representations from observational data, including CT images and associated imaging parameters. These input features are fused to predict the Signal-to-Noise Ratio (SNR) and support counterfactual inference, enabling what-if simulations, such as changes in contrast agents (types and concentrations) or scan parameters. CAPRI-CT is trained and validated using an ensemble learning approach, achieving strong predictive performance. By facilitating both prediction and interpretability, CAPRI-CT provides actionable insights that could help radiologists and technicians design more efficient CT protocols without repeated physical scans. The source code and dataset are publicly available at https://github.com/SnehaGeorge22/capri-ct.
中文: 本文提出CAPRI-CT框架,通过整合CT图像与采集参数建立因果模型,利用变分自编码器预测信噪比并支持反事实推理,为优化CT扫描方案提供可操作的见解。
English: This paper introduces CAPRI-CT, a causal-aware deep learning framework that models causal relationships between CT imaging parameters and image quality to predict SNR and enable counterfactual simulations, aiding in protocol optimization without physical scans.
Authors:Jun Li, Jinpeng Wang, Chaolei Tan, Niu Lian, Long Chen, Yaowei Wang, Min Zhang, Shu-Tao Xia, Bin Chen
Abstract:
Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of matching untrimmed videos with text queries describing only partial content. Existing methods suffer from geometric distortion in Euclidean space that sometimes misrepresents the intrinsic hierarchical structure of videos and overlooks certain hierarchical semantics, ultimately leading to suboptimal temporal modeling. To address this issue, we propose the first hyperbolic modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block and Euclidean Attention Block to encode video embeddings in hybrid spaces, using the Mean-Guided Adaptive Interaction Module to dynamically fuse features. Additionally, we introduce a Partial Order Preservation Loss to enforce "text < video" hierarchy through Lorentzian cone constraints. This approach further enhances cross-modal matching by reinforcing partial relevance between video content and text queries. Extensive experiments show that HLFormer outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICCV25-HLFormer.
中文摘要:本文提出HLFormer,一种用于部分相关视频检索的双曲建模框架,通过融合混合空间编码和偏序保持机制,有效克服欧几里得空间在层次结构建模上的不足,显著提升了跨模态匹配性能。
English Summary: The paper introduces HLFormer, a hyperbolic modeling framework for Partially Relevant Video Retrieval that overcomes Euclidean space limitations by integrating hybrid space encoding and partial order preservation to enhance hierarchical representation and cross-modal matching.
Authors:Hao Dai, Jagmohan Chauhan
Abstract:
Continual Generalized Category Discovery (C-GCD) faces a critical challenge: incrementally learning new classes from unlabeled data streams while preserving knowledge of old classes. Existing methods struggle with catastrophic forgetting, especially when unlabeled data mixes known and novel categories. We address this by analyzing C-GCD's forgetting dynamics through a Bayesian lens, revealing that covariance misalignment between old and new classes drives performance degradation. Building on this insight, we propose Variational Bayes C-GCD (VB-CGCD), a novel framework that integrates variational inference with covariance-aware nearest-class-mean classification. VB-CGCD adaptively aligns class distributions while suppressing pseudo-label noise via stochastic variational updates. Experiments show VB-CGCD surpasses prior art by +15.21% with the overall accuracy in the final session on standard benchmarks. We also introduce a new challenging benchmark with only 10% labeled data and extended online phases, VB-CGCD achieves a 67.86% final accuracy, significantly higher than state-of-the-art (38.55%), demonstrating its robust applicability across diverse scenarios. Code is available at: https://github.com/daihao42/VB-CGCD
中文: 提出的变分贝叶斯C-GCD框架通过变分推理对齐类别分布,有效缓解持续学习中的灾难性遗忘问题,在标准基准测试中比现有方法准确率提升15.21%。
English: The proposed Variational Bayes C-GCD framework effectively mitigates catastrophic forgetting in continual learning by aligning class distributions through variational inference, achieving a 15.21% accuracy improvement over prior methods on standard benchmarks.
Authors:Hyeongmin Lee, Kyungjune Baek
Abstract:
Dynamic 4D Gaussian Splatting (4DGS) effectively extends the high-speed rendering capabilities of 3D Gaussian Splatting (3DGS) to represent volumetric videos. However, the large number of Gaussians, substantial temporal redundancies, and especially the absence of an entropy-aware compression framework result in large storage requirements. Consequently, this poses significant challenges for practical deployment, efficient edge-device processing, and data transmission. In this paper, we introduce a novel end-to-end RD-optimized compression framework tailored for 4DGS, aiming to enable flexible, high-fidelity rendering across varied computational platforms. Leveraging Fully Explicit Dynamic Gaussian Splatting (Ex4DGS), one of the state-of-the-art 4DGS methods, as our baseline, we start from the existing 3DGS compression methods for compatibility while effectively addressing additional challenges introduced by the temporal axis. In particular, instead of storing motion trajectories independently per point, we employ a wavelet transform to reflect the real-world smoothness prior, significantly enhancing storage efficiency. This approach yields significantly improved compression ratios and provides a user-controlled balance between compression efficiency and rendering quality. Extensive experiments demonstrate the effectiveness of our method, achieving up to 91$\times$ compression compared to the original Ex4DGS model while maintaining high visual fidelity. These results highlight the applicability of our framework for real-time dynamic scene rendering in diverse scenarios, from resource-constrained edge devices to high-performance environments. The source code is available at https://github.com/HyeongminLEE/RD4DGS.
中文: 本文提出了一种针对4D高斯泼溅的端到端压缩框架,通过小波变换和率失真优化实现了高达91倍的压缩比,同时保持高视觉保真度,适用于实时动态场景渲染。
English: This paper introduces an end-to-end compression framework for 4D Gaussian Splatting that achieves up to 91× compression through wavelet transforms and rate-distortion optimization while maintaining high visual fidelity for real-time dynamic scene rendering.
Authors:Ruijie Yang, Yan Zhu, Peiyao Fu, Yizhe Zhang, Zhihua Wang, Quanlin Li, Pinghong Zhou, Xian Yang, Shuo Wang
Abstract:
Colorectal cancer (CRC) remains a leading cause of cancer-related mortality, underscoring the importance of timely polyp detection and diagnosis. While deep learning models have improved optical-assisted diagnostics, they often demand extensive labeled datasets and yield "black-box" outputs with limited interpretability. In this paper, we propose EndoFinder, an online polyp retrieval framework that leverages multi-view scene representations for explainable and scalable CRC diagnosis. First, we develop a Polyp-aware Image Encoder by combining contrastive learning and a reconstruction task, guided by polyp segmentation masks. This self-supervised approach captures robust features without relying on large-scale annotated data. Next, we treat each polyp as a three-dimensional "scene" and introduce a Scene Representation Transformer, which fuses multiple views of the polyp into a single latent representation. By discretizing this representation through a hashing layer, EndoFinder enables real-time retrieval from a compiled database of historical polyp cases, where diagnostic information serves as interpretable references for new queries. We evaluate EndoFinder on both public and newly collected polyp datasets for re-identification and pathology classification. Results show that EndoFinder outperforms existing methods in accuracy while providing transparent, retrieval-based insights for clinical decision-making. By contributing a novel dataset and a scalable, explainable framework, our work addresses key challenges in polyp diagnosis and offers a promising direction for more efficient AI-driven colonoscopy workflows. The source code is available at https://github.com/ku262/EndoFinder-Scene.
中文: EndoFinder是一个基于多视角场景表征的可解释息肉检索框架,通过自监督学习和三维场景重建实现结直肠癌诊断,在提高准确率的同时为临床决策提供透明依据。
English: EndoFinder is an explainable polyp retrieval framework that uses multi-view scene representations for colorectal cancer diagnosis, achieving higher accuracy and transparency than existing methods while reducing reliance on large annotated datasets.
Authors:Peiqi Chen, Lei Yu, Yi Wan, Yingying Pei, Xinyi Liu, Yongxiang Yao, Yingying Zhang, Lixiang Ru, Liheng Zhong, Jingdong Chen, Ming Yang, Yongjun Zhang
Abstract:
Semi-dense feature matching methods have shown strong performance in challenging scenarios. However, the existing pipeline relies on a global search across the entire feature map to establish coarse matches, limiting further improvements in accuracy and efficiency. Motivated by this limitation, we propose a novel pipeline, CasP, which leverages cascaded correspondence priors for guidance. Specifically, the matching stage is decomposed into two progressive phases, bridged by a region-based selective cross-attention mechanism designed to enhance feature discriminability. In the second phase, one-to-one matches are determined by restricting the search range to the one-to-many prior areas identified in the first phase. Additionally, this pipeline benefits from incorporating high-level features, which helps reduce the computational costs of low-level feature extraction. The acceleration gains of CasP increase with higher resolution, and our lite model achieves a speedup of $\sim2.2\times$ at a resolution of 1152 compared to the most efficient method, ELoFTR. Furthermore, extensive experiments demonstrate its superiority in geometric estimation, particularly with impressive cross-domain generalization. These advantages highlight its potential for latency-sensitive and high-robustness applications, such as SLAM and UAV systems. Code is available at https://github.com/pq-chen/CasP.
中文摘要:提出的CasP流程采用级联对应先验和两阶段匹配机制,通过选择性交叉注意力增强特征区分度,在提升半稠密特征匹配精度的同时大幅加速计算,尤其在跨域泛化方面表现突出,适用于SLAM和无人机等实时高鲁棒性场景。
English Summary: The proposed CasP pipeline introduces cascaded correspondence priors and a two-phase matching process with selective cross-attention to enhance accuracy and efficiency in semi-dense feature matching, achieving significant speed improvements and superior cross-domain generalization for applications like SLAM and UAV systems.
Authors:Tobias Morocutti, Jonathan Greif, Paul Primus, Florian Schmid, Gerhard Widmer
Abstract:
Spatial semantic segmentation of sound scenes (S5) involves the accurate identification of active sound classes and the precise separation of their sources from complex acoustic mixtures. Conventional systems rely on a two-stage pipeline - audio tagging followed by label-conditioned source separation - but are often constrained by the absence of fine-grained temporal information critical for effective separation. In this work, we address this limitation by introducing a novel approach for S5 that enhances the synergy between the event detection and source separation stages. Our key contributions are threefold. First, we fine-tune a pre-trained Transformer to detect active sound classes. Second, we utilize a separate instance of this fine-tuned Transformer to perform sound event detection (SED), providing the separation module with detailed, time-varying guidance. Third, we implement an iterative refinement mechanism that progressively enhances separation quality by recursively reusing the separator's output from previous iterations. These advancements lead to significant improvements in both audio tagging and source separation performance, as demonstrated by our system's second-place finish in Task 4 of the DCASE Challenge 2025. Our implementation and model checkpoints are available in our GitHub repository: https://github.com/theMoro/dcase25task4 .
中文摘要:本研究提出了一种新颖的空间语义声音分割方法,通过微调Transformer进行声音事件检测并结合迭代优化机制,显著提升了音频分类和声源分离性能,并在DCASE 2025挑战赛中验证了其有效性。
English Summary: This study introduces a novel approach for spatial semantic sound segmentation that integrates a fine-tuned Transformer for sound event detection with an iterative refinement mechanism, significantly improving both audio classification and source separation performance as demonstrated in the DCASE Challenge 2025.
Authors:Jianxin Bi, Kevin Yuchen Ma, Ce Hao, Mike Zheng Shou, Harold Soh
Abstract:
Tactile feedback is generally recognized to be crucial for effective interaction with the physical world. However, state-of-the-art Vision-Language-Action (VLA) models lack the ability to interpret and use tactile signals, limiting their effectiveness in contact-rich tasks. Incorporating tactile feedback into these systems is challenging due to the absence of large multi-modal datasets. We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing \emph{without fine-tuning} the base VLA. Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation. Through real-world experiments, we demonstrate that our dual-level integration of tactile feedback improves task planning efficiency while enhancing execution precision. Code is open-sourced at \href{https://github.com/jxbi1010/VLA-Touch}{this URL}.
中文: VLA-Touch通过引入触觉语言模型进行任务规划和扩散控制器优化操作,在不微调基础VLA模型的情况下,提升了机器人策略的效率和执行精度。
English: VLA-Touch enhances robot policies by integrating tactile feedback through a tactile-language model for task planning and a diffusion-based controller for precise manipulation, improving efficiency and execution without fine-tuning the base VLA model.
Authors:Ruodai Cui, Li Niu, Guosheng Hu
Abstract:
Current exposure correction methods have three challenges, labor-intensive paired data annotation, limited generalizability, and performance degradation in low-level computer vision tasks. In this work, we introduce an innovative Unsupervised Exposure Correction (UEC) method that eliminates the need for manual annotations, offers improved generalizability, and enhances performance in low-level downstream tasks. Our model is trained using freely available paired data from an emulated Image Signal Processing (ISP) pipeline. This approach does not need expensive manual annotations, thereby minimizing individual style biases from the annotation and consequently improving its generalizability. Furthermore, we present a large-scale Radiometry Correction Dataset, specifically designed to emphasize exposure variations, to facilitate unsupervised learning. In addition, we develop a transformation function that preserves image details and outperforms state-of-the-art supervised methods [12], while utilizing only 0.01% of their parameters. Our work further investigates the broader impact of exposure correction on downstream tasks, including edge detection, demonstrating its effectiveness in mitigating the adverse effects of poor exposure on low-level features. The source code and dataset are publicly available at https://github.com/BeyondHeaven/uec_code.
Chinese: 本文提出了一种无监督曝光校正(UEC)方法,通过使用模拟配对数据和新型变换函数,无需人工标注,提高了泛化能力,并增强了低级视觉任务的性能。
English: This paper introduces an Unsupervised Exposure Correction (UEC) method that eliminates manual annotations, improves generalizability, and enhances performance in low-level vision tasks by using simulated paired data and a novel transformation function.
Authors:Feng Cao, Zishuo Feng, Wei Shi, Jicong Zhang
Abstract:
Extracellular recordings are transient voltage fluctuations in the vicinity of neurons, serving as a fundamental modality in neuroscience for decoding brain activity at single-neuron resolution. Spike sorting, the process of attributing each detected spike to its corresponding neuron, is a pivotal step in brain sensing pipelines. However, it remains challenging under low signal-to-noise ratio (SNR), electrode drift, and cross-session variability. In this paper, we propose HuiduRep, a robust self-supervised representation learning framework that extracts discriminative and generalizable features from extracellular recordings. By integrating contrastive learning with a denoising autoencoder, HuiduRep learns latent representations robust to noise and drift. With HuiduRep, we develop a spike sorting pipeline that clusters spike representations without ground truth labels. Experiments on hybrid and real-world datasets demonstrate that HuiduRep achieves strong robustness. Furthermore, the pipeline significantly outperforms state-of-the-art tools such as KiloSort4 and MountainSort5 on accuracy and precision on diverse datasets. These findings demonstrate the potential of self-supervised spike representation learning as a foundational tool for robust and generalizable processing of extracellular recordings. Code is available at: https://github.com/IgarashiAkatuki/HuiduRep
中文: 本文提出的HuiduRep自监督框架能从细胞外记录中学习鲁棒的尖峰表征,无需标签即可实现精确的尖峰分类,其性能显著优于KiloSort4和MountainSort5等现有工具。
English: This paper introduces HuiduRep, a self-supervised framework that learns robust spike representations from extracellular recordings, enabling accurate spike sorting without labels and outperforming existing tools like KiloSort4 and MountainSort5.
Authors:Haiyu Wu, Jaskirat Singh, Sicong Tian, Liang Zheng, Kevin W. Bowyer
Abstract:
When synthesizing identities as face recognition training data, it is generally believed that large inter-class separability and intra-class attribute variation are essential for synthesizing a quality dataset. % This belief is generally correct, and this is what we aim for. However, when increasing intra-class variation, existing methods overlook the necessity of maintaining intra-class identity consistency. % To address this and generate high-quality face training data, we propose Vec2Face+, a generative model that creates images directly from image features and allows for continuous and easy control of face identities and attributes. Using Vec2Face+, we obtain datasets with proper inter-class separability and intra-class variation and identity consistency using three strategies: 1) we sample vectors sufficiently different from others to generate well-separated identities; 2) we propose an AttrOP algorithm for increasing general attribute variations; 3) we propose LoRA-based pose control for generating images with profile head poses, which is more efficient and identity-preserving than AttrOP. % Our system generates VFace10K, a synthetic face dataset with 10K identities, which allows an FR model to achieve state-of-the-art accuracy on seven real-world test sets. Scaling the size to 4M and 12M images, the corresponding VFace100K and VFace300K datasets yield higher accuracy than the real-world training dataset, CASIA-WebFace, on five real-world test sets. This is the first time a synthetic dataset beats the CASIA-WebFace in average accuracy. In addition, we find that only 1 out of 11 synthetic datasets outperforms random guessing (\emph{i.e., 50\%}) in twin verification and that models trained with synthetic identities are more biased than those trained with real identities. Both are important aspects for future investigation. Code is available at https://github.com/HaiyuWu/Vec2Face_plus
中文: Vec2Face+ 是一种生成模型,通过直接处理图像特征并控制人脸身份与属性,在保持类内身份一致性的同时增强类间差异与类内变化,生成的合成人脸数据集在多个真实测试集上达到最先进精度,首次超越真实数据集CASIA-WebFace。
English: Vec2Face+ is a generative model that synthesizes high-quality face recognition training data by ensuring strong inter-class separability and intra-class variation while maintaining identity consistency, achieving state-of-the-art accuracy on real-world test sets and surpassing real datasets like CASIA-WebFace.
Authors:Shaohan Li, Hao Yang, Min Chen, Xiaolin Qin
Abstract:
The increasing frequency of extreme weather events due to global climate change urges accurate weather prediction. Recently, great advances have been made by the \textbf{end-to-end methods}, thanks to deep learning techniques, but they face limitations of \textit{representation inconsistency} in multivariable integration and struggle to effectively capture the dependency between variables, which is required in complex weather systems. Treating different variables as distinct modalities and applying a \textbf{two-stage training approach} from multimodal models can partially alleviate this issue, but due to the inconformity in training tasks between the two stages, the results are often suboptimal. To address these challenges, we propose an implicit two-stage training method, configuring separate encoders and decoders for each variable. In detailed, in the first stage, the Translator is frozen while the Encoders and Decoders learn a shared latent space, in the second stage, the Encoders and Decoders are frozen, and the Translator captures inter-variable interactions for prediction. Besides, by introducing a self-attention mechanism for multivariable fusion in the latent space, the performance achieves further improvements. Empirically, extensive experiments show the state-of-the-art performance of our method. Specifically, it reduces the MSE for near-surface air temperature and relative humidity predictions by 28.82\% and 23.39\%, respectively. The source code is available at https://github.com/ShremG/Met2Net.
中文: 针对端到端天气预测模型中的表征不一致和变量依赖关系捕捉问题,本研究提出采用分变量编码解码的隐式两阶段训练方法,并通过潜在空间自注意力机制增强多变量融合,在温度和湿度预测上实现显著误差降低的突破性性能。
English: To address representation inconsistency and dependency capture issues in end-to-end weather prediction models, this study introduces an implicit two-stage training method with separate encoders and decoders per variable, enhanced by self-attention for multivariable fusion, achieving state-of-the-art performance with significant error reductions in temperature and humidity forecasts.
Authors:Lingfeng Zeng, Fangqi Lou, Zixuan Wang, Jiajie Xu, Jinyi Niu, Mengping Li, Yifan Dong, Qi Qi, Wei Zhang, Ziwei Yang, Jun Han, Ruilun Feng, Ruiqi Hu, Lejie Zhang, Zhengbo Feng, Yicheng Ren, Xin Guo, Zhaowei Liu, Dongpo Cheng, Weige Cai, Liwen Zhang
Abstract:
The booming development of AI agents presents unprecedented opportunities for automating complex tasks across various domains. However, their multi-step, multi-tool collaboration capabilities in the financial sector remain underexplored. This paper introduces FinGAIA, an end-to-end benchmark designed to evaluate the practical abilities of AI agents in the financial domain. FinGAIA comprises 407 meticulously crafted tasks, spanning seven major financial sub-domains: securities, funds, banking, insurance, futures, trusts, and asset management. These tasks are organized into three hierarchical levels of scenario depth: basic business analysis, asset decision support, and strategic risk management. We evaluated 10 mainstream AI agents in a zero-shot setting. The best-performing agent, ChatGPT, achieved an overall accuracy of 48.9\%, which, while superior to non-professionals, still lags financial experts by over 35 percentage points. Error analysis has revealed five recurring failure patterns: Cross-modal Alignment Deficiency, Financial Terminological Bias, Operational Process Awareness Barrier, among others. These patterns point to crucial directions for future research. Our work provides the first agent benchmark closely related to the financial domain, aiming to objectively assess and promote the development of agents in this crucial field. Partial data is available at https://github.com/SUFE-AIFLM-Lab/FinGAIA.
中文: 本文提出FinGAIA金融领域基准测试,通过407项任务评估AI代理能力,发现最优模型准确率仍远逊于专家,并揭示了五大改进方向。
English: This paper introduces FinGAIA, a comprehensive benchmark for evaluating AI agents' performance in financial tasks, revealing that even top models like ChatGPT significantly lag behind experts and identifying key areas for improvement.
Authors:Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang
Abstract:
Although large language models (LLMs) have made significant progress in understanding Structured Knowledge (SK) like KG and Table, existing evaluations for SK understanding are non-rigorous (i.e., lacking evaluations of specific capabilities) and focus on a single type of SK. Therefore, we aim to propose a more comprehensive and rigorous structured knowledge understanding benchmark to diagnose the shortcomings of LLMs. In this paper, we introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text. We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units. To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection. Empirical evaluations on 8 representative LLMs, including the advanced DeepSeek-R1, indicate that existing LLMs still face significant challenges in understanding structured knowledge, and their performance is influenced by factors such as the amount of noise, the order of knowledge units, and hallucination phenomenon. Our dataset and code are available at https://github.com/zjukg/SKA-Bench.
中文: 本文提出SKA-Bench这一结构化知识理解基准,通过四种知识形式和四项基础能力测试评估大语言模型,发现现有模型在处理噪声、顺序敏感性和幻觉现象方面仍面临重大挑战。
English: This paper introduces SKA-Bench, a comprehensive benchmark for evaluating large language models' structured knowledge understanding across four knowledge forms and four fundamental abilities, revealing that current models still struggle significantly with noise, order sensitivity, and hallucinations.
Authors:Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang
Abstract:
Although large language models (LLMs) have made significant progress in understanding Structured Knowledge (SK) like KG and Table, existing evaluations for SK understanding are non-rigorous (i.e., lacking evaluations of specific capabilities) and focus on a single type of SK. Therefore, we aim to propose a more comprehensive and rigorous structured knowledge understanding benchmark to diagnose the shortcomings of LLMs. In this paper, we introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text. We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units. To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection. Empirical evaluations on 8 representative LLMs, including the advanced DeepSeek-R1, indicate that existing LLMs still face significant challenges in understanding structured knowledge, and their performance is influenced by factors such as the amount of noise, the order of knowledge units, and hallucination phenomenon. Our dataset and code are available at https://github.com/zjukg/SKA-Bench.
中文: 本文提出SKA-Bench这一结构化知识理解基准,通过四种知识形式和四项基础能力测试评估大语言模型,发现现有模型在处理噪声、顺序敏感性和幻觉现象方面仍面临重大挑战。
English: This paper introduces SKA-Bench, a comprehensive benchmark for evaluating large language models' structured knowledge understanding across four knowledge forms and four fundamental abilities, revealing that current models still struggle significantly with noise, order sensitivity, and hallucinations.
Authors:Bharath Krishnamurthy, Ajita Rattani
Abstract:
Ocular biometrics in the visible spectrum have emerged as a prominent modality due to their high accuracy, resistance to spoofing, and non-invasive nature. However, morphing attacks, synthetic biometric traits created by blending features from multiple individuals, threaten biometric system integrity. While extensively studied for near-infrared iris and face biometrics, morphing in visible-spectrum ocular data remains underexplored. Simulating such attacks demands advanced generation models that handle uncontrolled conditions while preserving detailed ocular features like iris boundaries and periocular textures. To address this gap, we introduce DOOMGAN, that encompasses landmark-driven encoding of visible ocular anatomy, attention-guided generation for realistic morph synthesis, and dynamic weighting of multi-faceted losses for optimized convergence. DOOMGAN achieves over 20% higher attack success rates than baseline methods under stringent thresholds, along with 20% better elliptical iris structure generation and 30% improved gaze consistency. We also release the first comprehensive ocular morphing dataset to support further research in this domain.
Chinese: DOOMGAN作为一种先进模型,通过结合地标驱动编码、注意力引导生成和动态损失加权,有效模拟可见光谱眼部生物特征的形态攻击,相比基线方法在攻击成功率和特征保持方面表现更优。
English: DOOMGAN is introduced as an advanced model that effectively simulates morphing attacks on visible-spectrum ocular biometrics by integrating landmark-driven encoding, attention-guided generation, and dynamic loss weighting, achieving superior attack success rates and feature preservation compared to baseline methods.
Authors:Ruodai Cui, Lei Zhang
Abstract:
Existing image contrast enhancement methods are typically designed for specific tasks such as under-/over-exposure correction, low-light and backlit image enhancement, etc. The learned models, however, exhibit poor generalization performance across different tasks, even across different datasets of a specific task. It is important to explore whether we can learn a universal and generalized model for various contrast enhancement tasks. In this work, we observe that the common key factor of these tasks lies in the need of exposure and contrast adjustment, which can be well-addressed if high-dynamic range (HDR) inputs are available. We hence collect 46,928 HDR raw images from public sources, and render 328,496 sRGB images to build multi-exposure sequences (MES) and the corresponding pseudo sRGB ground-truths via multi-exposure fusion. Consequently, we train a network to generate an MES from a single sRGB image, followed by training another network to fuse the generated MES into an enhanced image. Our proposed method, namely UNiversal Image Contrast Enhancer (UNICE), is free of costly human labeling. However, it demonstrates significantly stronger generalization performance than existing image contrast enhancement methods across and within different tasks, even outperforming manually created ground-truths in multiple no-reference image quality metrics. The dataset, code and model are available at https://github.com/BeyondHeaven/UNICE.
中文: 现有图像对比度增强方法通常针对特定任务设计且泛化能力差,而UNICE通过利用高动态范围输入和多曝光融合技术,无需人工标注即可显著提升不同任务间的通用性和增强效果。
English: Current image contrast enhancement methods are task-specific and lack generalization, but UNICE introduces a universal model that uses high-dynamic range inputs and multi-exposure fusion to significantly improve performance across various tasks without human labeling.
Authors:Fangze Lin, Ying He, Fei Yu, Hong Zhang
Abstract:
Predicting the future motion of road participants is a critical task in autonomous driving. In this work, we address the challenge of low-quality generation of low-probability modes in multi-agent joint prediction. To tackle this issue, we propose a two-stage multi-agent interactive prediction framework named \textit{keypoint-guided joint prediction after classification-aware marginal proposal} (JAM). The first stage is modeled as a marginal prediction process, which classifies queries by trajectory type to encourage the model to learn all categories of trajectories, providing comprehensive mode information for the joint prediction module. The second stage is modeled as a joint prediction process, which takes the scene context and the marginal proposals from the first stage as inputs to learn the final joint distribution. We explicitly introduce key waypoints to guide the joint prediction module in better capturing and leveraging the critical information from the initial predicted trajectories. We conduct extensive experiments on the real-world Waymo Open Motion Dataset interactive prediction benchmark. The results show that our approach achieves competitive performance. In particular, in the framework comparison experiments, the proposed JAM outperforms other prediction frameworks and achieves state-of-the-art performance in interactive trajectory prediction. The code is available at https://github.com/LinFunster/JAM to facilitate future research.
Chinese: 本文提出JAM双阶段框架,通过先对轨迹类型分类确保模式覆盖,再利用关键路径点优化联合预测,在Waymo数据集上实现了交互式轨迹预测的最优性能。
English: This paper introduces JAM, a two-stage framework that enhances multi-agent trajectory prediction by first classifying trajectory types for comprehensive mode coverage and then using key waypoints to refine joint predictions, achieving state-of-the-art results on the Waymo dataset.
Authors:Anirudh Satheesh, Anant Khandelwal, Mucong Ding, Radu Balan
Abstract:
Neural operators offer a powerful paradigm for solving partial differential equations (PDEs) that cannot be solved analytically by learning mappings between function spaces. However, there are two main bottlenecks in training neural operators: they require a significant amount of training data to learn these mappings, and this data needs to be labeled, which can only be accessed via expensive simulations with numerical solvers. To alleviate both of these issues simultaneously, we propose PICore, an unsupervised coreset selection framework that identifies the most informative training samples without requiring access to ground-truth PDE solutions. PICore leverages a physics-informed loss to select unlabeled inputs by their potential contribution to operator learning. After selecting a compact subset of inputs, only those samples are simulated using numerical solvers to generate labels, reducing annotation costs. We then train the neural operator on the reduced labeled dataset, significantly decreasing training time as well. Across four diverse PDE benchmarks and multiple coreset selection strategies, PICore achieves up to 78% average increase in training efficiency relative to supervised coreset selection methods with minimal changes in accuracy. We provide code at https://github.com/Asatheesh6561/PICore.
Chinese: PICore 是一种无监督核心集选择框架,通过基于物理信息的损失函数筛选最具信息量的未标记输入,显著降低神经算子的数据标注成本与训练时间,在精度损失最小的情况下实现高达78%的效率提升。
English: PICore is an unsupervised coreset selection framework that reduces both data labeling costs and training time for neural operators by identifying the most informative unlabeled inputs using physics-informed loss, achieving up to 78% efficiency gains with minimal accuracy loss.
Authors:Ting Jiang, Yixiao Wang, Hancheng Ye, Zishan Shao, Jingwei Sun, Jingyang Zhang, Zekai Chen, Jianyi Zhang, Yiran Chen, Hai Li
Abstract:
Diffusion models have achieved remarkable success in generative tasks but suffer from high computational costs due to their iterative sampling process and quadratic attention costs. Existing training-free acceleration strategies that reduce per-step computation cost, while effectively reducing sampling time, demonstrate low faithfulness compared to the original baseline. We hypothesize that this fidelity gap arises because (a) different prompts correspond to varying denoising trajectory, and (b) such methods do not consider the underlying ODE formulation and its numerical solution. In this paper, we propose Stability-guided Adaptive Diffusion Acceleration (SADA), a novel paradigm that unifies step-wise and token-wise sparsity decisions via a single stability criterion to accelerate sampling of ODE-based generative models (Diffusion and Flow-matching). For (a), SADA adaptively allocates sparsity based on the sampling trajectory. For (b), SADA introduces principled approximation schemes that leverage the precise gradient information from the numerical ODE solver. Comprehensive evaluations on SD-2, SDXL, and Flux using both EDM and DPM++ solvers reveal consistent $\ge 1.8\times$ speedups with minimal fidelity degradation (LPIPS $\leq 0.10$ and FID $\leq 4.5$) compared to unmodified baselines, significantly outperforming prior methods. Moreover, SADA adapts seamlessly to other pipelines and modalities: It accelerates ControlNet without any modifications and speeds up MusicLDM by $1.8\times$ with $\sim 0.01$ spectrogram LPIPS.
中文: 扩散模型存在计算成本高且现有加速方法保真度低的问题,而提出的SADA框架通过自适应稀疏分配和利用ODE求解器梯度,在多种模型和模态上实现了显著加速且质量损失极小。
English: Diffusion models face high computational costs and fidelity issues with current acceleration methods, but the proposed SADA framework adaptively optimizes sparsity and leverages ODE solver gradients to achieve significant speedups with minimal quality loss across various models and modalities.
Authors:Seokhwan Jeong, Hogyun Kim, Younggun Cho
Abstract:
This paper presents a novel spherical target-based LiDAR-camera extrinsic calibration method designed for outdoor environments with multi-robot systems, considering both target and sensor corruption. The method extracts the 2D ellipse center from the image and the 3D sphere center from the pointcloud, which are then paired to compute the transformation matrix. Specifically, the image is first decomposed using the Segment Anything Model (SAM). Then, a novel algorithm extracts an ellipse from a potentially corrupted sphere, and the extracted center of ellipse is corrected for errors caused by the perspective projection model. For the LiDAR pointcloud, points on the sphere tend to be highly noisy due to the absence of flat regions. To accurately extract the sphere from these noisy measurements, we apply a hierarchical weighted sum to the accumulated pointcloud. Through experiments, we demonstrated that the sphere can be robustly detected even under both types of corruption, outperforming other targets. We evaluated our method using three different types of LiDARs (spinning, solid-state, and non-repetitive) with cameras positioned in three different locations. Furthermore, we validated the robustness of our method to target corruption by experimenting with spheres subjected to various types of degradation. These experiments were conducted in both a planetary test and a field environment. Our code is available at https://github.com/sparolab/MARSCalib.
本文提出了一种新颖的球形目标激光雷达-相机外参标定方法,适用于户外多机器人系统,通过提取图像中的二维椭圆中心和点云中的三维球心进行配对计算,能够有效应对目标和传感器损坏问题。
This paper introduces a robust LiDAR-camera calibration method for outdoor multi-robot systems that effectively handles both target and sensor corruption by extracting and pairing 2D ellipse centers from images and 3D sphere centers from pointclouds.
Authors:Masayoshi Someya, Taisuke Yamada, Tomohisa Okazaki
Abstract:
The Okada model is a widely used analytical solution for displacements and strains caused by a point or rectangular dislocation source in a 3D elastic half-space. We present OkadaTorch, a PyTorch implementation of the Okada model, where the entire code is differentiable; gradients with respect to input can be easily computed using automatic differentiation (AD). Our work consists of two components: a direct translation of the original Okada model into PyTorch, and a convenient wrapper interface for efficiently computing gradients and Hessians with respect to either observation station coordinates or fault parameters. This differentiable framework is well suited for fault parameter inversion, including gradient-based optimization, Bayesian inference, and integration with scientific machine learning (SciML) models. Our code is available here: https://github.com/msomeya1/OkadaTorch
Chinese: OkadaTorch是Okada模型的可微分PyTorch实现,能够高效计算梯度以进行断层参数反演,并与科学机器学习模型集成。
English: OkadaTorch is a differentiable PyTorch implementation of the Okada model, enabling efficient gradient computation for fault parameter inversion and integration with scientific machine learning.
Authors:Zaipeng Duan, Chenxu Dang, Xuzhong Hu, Pei An, Junfeng Ding, Jie Zhan, Yunbiao Xu, Jie Ma
Abstract:
Multimodal 3D occupancy prediction has garnered significant attention for its potential in autonomous driving. However, most existing approaches are single-modality: camera-based methods lack depth information, while LiDAR-based methods struggle with occlusions. Current lightweight methods primarily rely on the Lift-Splat-Shoot (LSS) pipeline, which suffers from inaccurate depth estimation and fails to fully exploit the geometric and semantic information of 3D LiDAR points. Therefore, we propose a novel multimodal occupancy prediction network called SDG-OCC, which incorporates a joint semantic and depth-guided view transformation coupled with a fusion-to-occupancy-driven active distillation. The enhanced view transformation constructs accurate depth distributions by integrating pixel semantics and co-point depth through diffusion and bilinear discretization. The fusion-to-occupancy-driven active distillation extracts rich semantic information from multimodal data and selectively transfers knowledge to image features based on LiDAR-identified regions. Finally, for optimal performance, we introduce SDG-Fusion, which uses fusion alone, and SDG-KL, which integrates both fusion and distillation for faster inference. Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset and shows comparable performance on the more challenging SurroundOcc-nuScenes dataset, demonstrating its effectiveness and robustness. The code will be released at https://github.com/DzpLab/SDGOCC.
Chinese: 提出的SDG-OCC网络通过联合语义和深度引导的视角变换与主动蒸馏技术,克服了现有多模态3D占据预测方法的局限性,在基准数据集上实现了最先进的性能并支持实时处理。
English: The proposed SDG-OCC network introduces a joint semantic and depth-guided view transformation with active distillation to overcome limitations in existing multimodal 3D occupancy prediction methods, achieving state-of-the-art performance on benchmark datasets while enabling real-time processing.
Authors:Zheng Tan, Tariq D. Aslam, Andrea L. Bertozzi
Abstract:
We explore a novel way to numerically resolve the scaling behavior of finite-time singularities in solutions of nonlinear parabolic PDEs. The Runge--Kutta--Legendre (RKL) and Runge--Kutta--Gegenbauer (RKG) super-time-stepping methods were originally developed for nonlinear complex physics problems with diffusion. These are multi-stage single step second-order, forward-in-time methods with no implicit solves. The advantage is that the timestep size for stability scales with stage number $s$ as $\mathcal{O}(s^2)$. Many interesting nonlinear PDEs have finite-time singularities, and the presence of diffusion often limits one to using implicit or semi-implicit timestep methods for stability constraints. Finite-time singularities are particularly challenging due to the large range of scales that one desires to resolve, often with adaptive spatial grids and adaptive timesteps. Here we show two examples of nonlinear PDEs for which the self-similar singularity structure has time and space scales that are resolvable using the RKL and RKG methods, without forcing even smaller timesteps. Compared to commonly-used implicit numerical methods, we achieve significantly smaller run time while maintaining comparable accuracy. We also prove numerical monotonicity for both the RKL and RKG methods under their linear stability conditions for the constant coefficient heat equation, in the case of infinite domain and periodic boundary condition, leading to a theoretical guarantee of the superiority of the RKL and RKG methods over traditional super-time-stepping methods, such as the Runge-Kutta-Chebyshev (RKC) and the orthogonal Runge-Kutta-Chebyshev (ROCK) methods. Code can be found at https://github.com/ZT220501/SRK-Singularity.
中文: 本研究证明RKL和RKG超时间步进方法能有效求解非线性抛物型偏微分方程的有限时间奇点,相比传统隐式方法在保持相当精度的同时显著提升了计算效率。
English: This study demonstrates that RKL and RKG super-time-stepping methods can efficiently resolve finite-time singularities in nonlinear parabolic PDEs with improved computational speed and comparable accuracy over traditional implicit methods.
Authors:Arduin Findeis, Floris Weers, Guoli Yin, Ke Ye, Ruoming Pang, Tom Gunter
Abstract:
Pairwise preferences over model responses are widely collected to evaluate and provide feedback to large language models (LLMs). Given two alternative model responses to the same input, a human or AI annotator selects the "better" response. This approach can provide feedback for domains where other hard-coded metrics are difficult to obtain (e.g., chat response quality), thereby helping model evaluation or training. However, for some domains high-quality pairwise comparisons can be tricky to obtain - from AI and humans. For example, for responses with many factual statements, annotators may disproportionately weigh writing quality rather than underlying facts. In this work, we explore augmenting standard AI annotator systems with additional tools to improve performance on three challenging response domains: long-form factual, math and code tasks. We propose a tool-using agentic system to provide higher quality feedback on these domains. Our system uses web-search and code execution to ground itself based on external validation, independent of the LLM's internal knowledge and biases. We provide extensive experimental results evaluating our method across the three targeted response domains as well as general annotation tasks, using RewardBench (incl. AlpacaEval and LLMBar), RewardMath, as well as three new datasets for domains with saturated pre-existing datasets. Our results indicate that external tools can indeed improve performance in many, but not all, cases. More generally, our experiments highlight the sensitivity of performance to simple parameters (e.g., prompt) and the need for improved (non-saturated) annotator benchmarks. We share our code at https://github.com/apple/ml-agent-evaluator.
中文: 本研究提出了一种利用工具的系统,通过结合网络搜索和代码执行来增强AI标注器,以改进在长文本事实、数学和代码等挑战性领域的成对偏好评估,结果表明外部工具在许多情况下能提升性能,同时强调了改进标注基准的必要性。
English: This study introduces a tool-using agentic system that enhances AI annotators with web-search and code execution to improve pairwise preference evaluations for challenging domains like long-form factual, math, and code tasks, showing that external tools boost performance in many cases while highlighting the need for better benchmarks.
Authors:Yueyao Xu, Yize Chen
Abstract:
Faced with increasing penetration of distributed energy resources and fast development of distribution grid energy management, topology identification of distribution grid becomes an important and fundamental task. As the underlying grid topology is usually unknown or incomplete to the utilities, it is becoming a fundamental task to efficiently identify the distribution grid network topology using limited measurements. A fast and accurate topology identification can help achieving the tasks of load monitoring, operation and control of power distribution system as well as outage detection. In this paper, we propose a novel and ultra-fast topology identification method. By adapting the subset sum method with a hierarchical structure, the overall grid topology can be inferred from fewer samples of smart meter power measurements. Such techniques can be applied in real time under the scenarios with fast topology change, and the proposed hierarchical algorithm is also robust against measurement noises.
Chinese: 本文提出了一种新颖、超快速的配电网拓扑识别方法,通过采用分层结构的子集和方法,能够从有限的智能电表数据中实时推断整体电网结构,并对测量噪声具有鲁棒性。
English: This paper introduces a novel, ultra-fast method for identifying distribution grid topology by adapting the subset sum method with a hierarchical structure, enabling real-time inference from limited smart meter data and robustness against noise.
Authors:Shmuel Amar, Ori Shapira, Aviv Slobodkin, Ido Dagan
Abstract:
A broad range of NLP tasks involve selecting relevant text spans from given source texts. Despite this shared objective, such \textit{content selection} tasks have traditionally been studied in isolation, each with its own modeling approaches, datasets, and evaluation metrics. In this work, we propose \textit{instruction-guided content selection (IGCS)} as a beneficial unified framework for such settings, where the task definition and any instance-specific request are encapsulated as instructions to a language model. To promote this framework, we introduce \igcsbench{}, the first unified benchmark covering diverse content selection tasks. Further, we create a large generic synthetic dataset that can be leveraged for diverse content selection tasks, and show that transfer learning with these datasets often boosts performance, whether dedicated training for the targeted task is available or not. Finally, we address generic inference time issues that arise in LLM-based modeling of content selection, assess a generic evaluation metric, and overall propose the utility of our resources and methods for future content selection models. Models and datasets available at https://github.com/shmuelamar/igcs.
中文:本文提出了指令引导内容选择(IGCS)作为统一框架来处理多种NLP内容选择任务,并配套发布了基准测试、合成数据集及优化方法,有效提升性能并解决推理难题。
English: This paper introduces instruction-guided content selection (IGCS) as a unified framework for diverse NLP content selection tasks, accompanied by a benchmark, synthetic dataset, and methods that enhance performance and address inference challenges.
Authors:Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Xuanhua He, Chenyang Zhu, Hongyu Liu, Yingqing He, Zeyu Wang, Zhifeng Li, Xiu Li, Wei Liu, Dan Xu, Linfeng Zhang, Qifeng Chen
Abstract:
With the rapid development of AI-generated content (AIGC), video generation has emerged as one of its most dynamic and impactful subfields. In particular, the advancement of video generation foundation models has led to growing demand for controllable video generation methods that can more accurately reflect user intent. Most existing foundation models are designed for text-to-video generation, where text prompts alone are often insufficient to express complex, multi-modal, and fine-grained user requirements. This limitation makes it challenging for users to generate videos with precise control using current models. To address this issue, recent research has explored the integration of additional non-textual conditions, such as camera motion, depth maps, and human pose, to extend pretrained video generation models and enable more controllable video synthesis. These approaches aim to enhance the flexibility and practical applicability of AIGC-driven video generation systems. In this survey, we provide a systematic review of controllable video generation, covering both theoretical foundations and recent advances in the field. We begin by introducing the key concepts and commonly used open-source video generation models. We then focus on control mechanisms in video diffusion models, analyzing how different types of conditions can be incorporated into the denoising process to guide generation. Finally, we categorize existing methods based on the types of control signals they leverage, including single-condition generation, multi-condition generation, and universal controllable generation. For a complete list of the literature on controllable video generation reviewed, please visit our curated repository at https://github.com/mayuelala/Awesome-Controllable-Video-Generation.
中文摘要:本综述系统梳理了可控视频生成领域,针对纯文本提示的局限性,探索了相机运动、深度图等多模态控制机制,以提升人工智能生成视频与用户意图的匹配度。
English Summary: This survey systematically reviews the field of controllable video generation, addressing the limitations of text-only prompts by exploring multi-modal control mechanisms like camera motion and depth maps to enhance user intent alignment in AI-generated videos.
Authors:Joey Spronck, Leander van Eekelen, Dominique van Midden, Joep Bogaerts, Leslie Tessier, Valerie Dechering, Muradije Demirel-Andishmand, Gabriel Silva de Souza, Roland Nemeth, Enrico Munari, Giuseppe Bogina, Ilaria Girolami, Albino Eccher, Balazs Acs, Ceren Boyaci, Natalie Klubickova, Monika Looijen-Salamon, Shoko Vos, Francesco Ciompi
Abstract:
The tumor immune microenvironment (TIME) in non-small cell lung cancer (NSCLC) histopathology contains morphological and molecular characteristics predictive of immunotherapy response. Computational quantification of TIME characteristics, such as cell detection and tissue segmentation, can support biomarker development. However, currently available digital pathology datasets of NSCLC for the development of cell detection or tissue segmentation algorithms are limited in scope, lack annotations of clinically prevalent metastatic sites, and forgo molecular information such as PD-L1 immunohistochemistry (IHC). To fill this gap, we introduce the IGNITE data toolkit, a multi-stain, multi-centric, and multi-scanner dataset of annotated NSCLC whole-slide images. We publicly release 887 fully annotated regions of interest from 155 unique patients across three complementary tasks: (i) multi-class semantic segmentation of tissue compartments in H&E-stained slides, with 16 classes spanning primary and metastatic NSCLC, (ii) nuclei detection, and (iii) PD-L1 positive tumor cell detection in PD-L1 IHC slides. To the best of our knowledge, this is the first public NSCLC dataset with manual annotations of H&E in metastatic sites and PD-L1 IHC.
中文摘要:IGNITE数据工具包作为首个公开的NSCLC数据集,提供了包含转移部位H&E染色和PD-L1免疫组化的全面标注,填补了当前数字病理学资源的空白。
English Summary: The IGNITE data toolkit is introduced as the first public NSCLC dataset with comprehensive annotations for tissue segmentation, nuclei detection, and PD-L1 IHC analysis, addressing current limitations in digital pathology resources.
Authors:Ning Li, Xiangmou Qu, Jiamu Zhou, Jun Wang, Muning Wen, Kounianhua Du, Xingyu Lou, Qiuying Peng, Jun Wang, Weinan Zhang
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have enabled the development of mobile agents that can understand visual inputs and follow user instructions, unlocking new possibilities for automating complex tasks on mobile devices. However, applying these models to real-world mobile scenarios remains a significant challenge due to the long-horizon task execution, difficulty in error recovery, and the cold-start problem in unfamiliar environments. To address these challenges, we propose MobileUse, a GUI agent designed for robust and adaptive mobile task execution. To improve resilience in long-horizon tasks and dynamic environments, we introduce a hierarchical reflection architecture that enables the agent to self-monitor, detect, and recover from errors across multiple temporal scales-ranging from individual actions to overall task completion-while maintaining efficiency through a reflection-on-demand strategy. To tackle cold-start issues, we further introduce a proactive exploration module, which enriches the agent's understanding of the environment through self-planned exploration. Evaluations on AndroidWorld and AndroidLab benchmarks demonstrate that MobileUse establishes new state-of-the-art performance, achieving success rates of 62.9% and 44.2%, respectively. To facilitate real-world applications, we release an out-of-the-box toolkit for automated task execution on physical mobile devices, which is available at https://github.com/MadeAgents/mobile-use.
中文摘要:MobileUse是一款用于移动任务自动化的GUI代理,它通过分层反思架构实现错误恢复,并采用主动探索解决冷启动问题,在基准测试中达到了最先进的性能水平。
English Summary: MobileUse is a robust GUI agent that addresses challenges in mobile task automation with a hierarchical reflection architecture for error recovery and proactive exploration to overcome cold-start issues, achieving state-of-the-art performance on benchmarks.
Authors:Giovanni De Toni, Erasmo Purificato, Emilia Gómez, Bruno Lepri, Andrea Passerini, Cristian Consonni
Abstract:
Recommenders are significantly shaping online information consumption. While effective at personalizing content, these systems increasingly face criticism for propagating irrelevant, unwanted, and even harmful recommendations. Such content degrades user satisfaction and contributes to significant societal issues, including misinformation, radicalization, and erosion of user trust. Although platforms offer mechanisms to mitigate exposure to undesired content, these mechanisms are often insufficiently effective and slow to adapt to users' feedback. This paper introduces an intuitive, model-agnostic, and distribution-free method that uses conformal risk control to provably bound unwanted content in personalized recommendations by leveraging simple binary feedback on items. We also address a limitation of traditional conformal risk control approaches, i.e., the fact that the recommender can provide a smaller set of recommended items, by leveraging implicit feedback on consumed items to expand the recommendation set while ensuring robust risk mitigation. Our experimental evaluation on data coming from a popular online video-sharing platform demonstrates that our approach ensures an effective and controllable reduction of unwanted recommendations with minimal effort. The source code is available here: https://github.com/geektoni/mitigating-harm-recsys.
中文: 本文提出一种与模型无关的方法,利用合规风险控制和二元用户反馈来可靠地限制个性化推荐中的不良内容,通过最少投入有效减少有害建议。
English: This paper presents a model-agnostic method using conformal risk control to provably limit unwanted content in personalized recommendations through binary user feedback, effectively reducing harmful suggestions with minimal effort.
Authors:Run-Ze Fan, Zengzhi Wang, Pengfei Liu
Abstract:
Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.
中文: 本研究推出了TextbookReasoning和MegaScience两个开放数据集,旨在填补高质量科学推理资源的空白,这些数据集在多个基准测试和基础模型上显著提升了性能与训练效率。
English: This study introduces TextbookReasoning and MegaScience, two open datasets designed to address the scarcity of high-quality scientific reasoning resources, which significantly enhance model performance and training efficiency across multiple benchmarks and base models.
Authors:Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Jingze Song, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wei Wang, Peng Zhang
Abstract:
Large Language Models (LLMs) exhibit considerable promise in financial applications; however, prevailing models frequently demonstrate limitations when confronted with scenarios that necessitate sophisticated reasoning capabilities, stringent trustworthiness criteria, and efficient adaptation to domain-specific requirements. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task label system with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage training pipeline, and dynamic attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including Fineva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA-diamond. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications. The Finova bench is available at https://github.com/antgroup/Finova.
中文:基于Qwen3开发的Agentar-Fin-R1系列金融大模型,通过多层可信保障框架和标签引导优化,在金融任务和通用推理基准测试中均表现出卓越性能,为高风险金融应用提供了可靠解决方案。
English: The Agentar-Fin-R1 series of financial large language models, built on Qwen3, significantly enhances reasoning, reliability, and domain specialization through advanced optimization frameworks, achieving top performance on financial benchmarks and demonstrating strong general reasoning capabilities.
Authors:Changhao Li, Xinrui Chen, Ji Wang, Kang Zhao, Jianfei Chen
Abstract:
Quantization is a key technique to reduce network size and computational complexity by representing the network parameters with a lower precision. Traditional quantization methods rely on access to original training data, which is often restricted due to privacy concerns or security challenges. Zero-shot Quantization (ZSQ) addresses this by using synthetic data generated from pre-trained models, eliminating the need for real training data. Recently, ZSQ has been extended to object detection. However, existing methods use unlabeled task-agnostic synthetic images that lack the specific information required for object detection, leading to suboptimal performance. In this paper, we propose a novel task-specific ZSQ framework for object detection networks, which consists of two main stages. First, we introduce a bounding box and category sampling strategy to synthesize a task-specific calibration set from the pre-trained network, reconstructing object locations, sizes, and category distributions without any prior knowledge. Second, we integrate task-specific training into the knowledge distillation process to restore the performance of quantized detection networks. Extensive experiments conducted on the MS-COCO and Pascal VOC datasets demonstrate the efficiency and state-of-the-art performance of our method. Our code is publicly available at: https://github.com/DFQ-Dojo/dfq-toolkit .
中文: 本文提出了一种新颖的面向目标检测网络的任务特定零样本量化框架,通过合成任务特定校准数据并结合任务感知训练来恢复性能,在基准数据集上取得了领先水平的结果。
English: This paper introduces a novel task-specific zero-shot quantization framework for object detection networks that synthesizes task-specific calibration data and integrates task-aware training to restore performance, achieving state-of-the-art results on benchmark datasets.
Authors:Ran Wang, Xiaoxuan Liu, Hao Ren, Gang Chen, Fanchao Qi, Maosong Sun
Abstract:
Structured decoding enables large language models (LLMs) to generate outputs in formats required by downstream systems, such as HTML or JSON. However, existing methods suffer from efficiency bottlenecks due to grammar compilation, state tracking, and mask creation. We observe that many real-world tasks embed strong prior knowledge about output structure. Leveraging this, we propose a decomposition of constraints into static and dynamic components -- precompiling static structures offline and instantiating dynamic arguments at runtime using grammar snippets. Instead of relying on pushdown automata, we employ a compositional set of operators to model regular formats, achieving lower transition latency. We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching, achieving up to 250x speedup over existing systems. wgrammar's source code is publicly available at https://github.com/wrran/wgrammar.
中文: 提出的wgrammar引擎通过将约束分解为静态和动态组件,利用预编译结构和语法片段加速结构化解码,相比现有方法实现了高达250倍的加速效果。
English: The proposed wgrammar engine accelerates structured decoding by decomposing constraints into static and dynamic components, using precompiled structures and grammar snippets to achieve up to 250x speedup over existing methods.
Authors:Marcel Kleinmann, Shashank Agnihotri, Margret Keuper
Abstract:
Faithfulness and interpretability are essential for deploying deep neural networks (DNNs) in safety-critical domains such as medical imaging. B-cos networks offer a promising solution by replacing standard linear layers with a weight-input alignment mechanism, producing inherently interpretable, class-specific explanations without post-hoc methods. While maintaining diagnostic performance competitive with state-of-the-art DNNs, standard B-cos models suffer from severe aliasing artifacts in their explanation maps, making them unsuitable for clinical use where clarity is essential. In this work, we address these limitations by introducing anti-aliasing strategies using FLCPooling (FLC) and BlurPool (BP) to significantly improve explanation quality. Our experiments on chest X-ray datasets demonstrate that the modified $\text{B-cos}_\text{FLC}$ and $\text{B-cos}_\text{BP}$ preserve strong predictive performance while providing faithful and artifact-free explanations suitable for clinical application in multi-class and multi-label settings. Code available at: GitHub repository (url: https://github.com/mkleinma/B-cos-medical-paper).
Chinese: 本研究通过引入FLCPooling和BlurPool等抗锯齿策略改进B-cos网络,在保持诊断性能的同时消除解释图中的伪影,使其适用于医学影像的临床应用。
English: The study enhances B-cos networks by integrating anti-aliasing techniques like FLCPooling and BlurPool, which eliminate artifacts in explanation maps while preserving diagnostic accuracy, making them viable for clinical use in medical imaging.
Authors:Keneni W. Tesema, Lyndon Hill, Mark W. Jones, Gary K. L. Tam
Abstract:
Point cloud completion is crucial for 3D computer vision tasks in autonomous driving, augmented reality, and robotics. However, obtaining clean and complete point clouds from real-world environments is challenging due to noise and occlusions. Consequently, most existing completion networks -- trained on synthetic data -- struggle with real-world degradations. In this work, we tackle the problem of completing and denoising highly corrupted partial point clouds affected by multiple simultaneous degradations. To benchmark robustness, we introduce the Corrupted Point Cloud Completion Dataset (CPCCD), which highlights the limitations of current methods under diverse corruptions. Building on these insights, we propose DWCNet (Denoising-While-Completing Network), a completion framework enhanced with a Noise Management Module (NMM) that leverages contrastive learning and self-attention to suppress noise and model structural relationships. DWCNet achieves state-of-the-art performance on both clean and corrupted, synthetic and real-world datasets. The dataset and code will be publicly available at https://github.com/keneniwt/DWCNET-Robust-Point-Cloud-Completion-against-Corruptions
中文: 本文提出DWCNet点云补全框架,通过噪声管理模块有效处理多种现实世界退化问题,在合成和真实数据集上均实现了最先进的性能。
English: This paper introduces DWCNet, a robust point cloud completion framework with a Noise Management Module that effectively handles multiple real-world degradations, achieving state-of-the-art performance on both synthetic and real-world datasets.
Authors:Yilong Xu, Xiang Long, Zhi Zheng, Jinhua Gao
Abstract:
Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine -- a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model's interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.
中文: 该摘要提出了RAVine评估框架,通过关注现实查询、精确构建真实答案和迭代过程评估,解决了现有智能搜索系统评估基准的不足,以推动其发展。
English: The abstract introduces RAVine, a reality-aligned evaluation framework designed to address the shortcomings of existing benchmarks for agentic search systems by focusing on realistic queries, accurate ground truth construction, and iterative process evaluation.
Authors:Yiguo He, Junjie Zhu, Yiying Li, Xiaoyu Zhang, Chunping Qiu, Jun Wang, Qiangjuan Huang, Ke Yang
Abstract:
The application of Vision-language foundation models (VLFMs) to remote sensing (RS) imagery has garnered significant attention due to their superior capability in various downstream tasks. A key challenge lies in the scarcity of high-quality, large-scale, image-text paired training data. Recently, several works introduced extensive image-text datasets for RS and trained their VLFMs. However, due to the rudimentary methods used for generating captions, the quality of datasets is suboptimal, requiring larger volumes of training data, while only yielding modest performance improvements. In this paper, we propose a two-stage method named MpGI(Multi-Perspective Generation and Integration) for generating high-quality text captions for RS images. Firstly, we generate distinct and detailed descriptions from different perspectives using Rule-MLLM(Multimodal Large Language Model) Relay Generation and MLLMs generation methods. Next, we utilize Large Language Models (LLMs) to integrate these diverse descriptions into comprehensive captions, capturing details from multiple perspectives. Finally, we have created the HQRS-IT-210K dataset, including about 210,000 RS images and 1.3 million captions. We fine-tuned two VLFMs using our dataset: CLIP, a discriminative model, and CoCa, an image-to-text generative model. This process resulted in our proposed HQRS-CLIP and RS-CoCa models. Experimental results demonstrate that HQRS-CLIP surpassed the previous SOTA RS CLIP model in various downstream tasks while using only 4.2\% of the training data. RS-CoCa outperforms other advanced approaches across benchmark datasets and can generate captions for RS images that rival or even exceed manual annotations. Dataset, pre-trained models, and codes will be released at https://github.com/YiguoHe/HQRS-210K-and-HQRS-CLIP.
Chinese: 本文提出了一种名为MpGI的两阶段方法,用于生成高质量的遥感图像文本描述,并创建了HQRS-IT-210K数据集,该数据集以极少的训练数据显著提升了视觉语言基础模型的性能。
English: This paper introduces MpGI, a two-stage method for generating high-quality captions for remote sensing images, and the resulting HQRS-IT-210K dataset, which significantly enhances VLFM performance with minimal training data.
Authors:Pingyi Fan, Anbai Jiang, Shuwei Zhang, Zhiqiang Lv, Bing Han, Xinhu Zheng, Wenrui Liang, Junjie Li, Wei-Qiang Zhang, Yanmin Qian, Xie Chen, Cheng Lu, Jia Liu
Abstract:
With the rapid deployment of SCADA systems, how to effectively analyze industrial signals and detect abnormal states is an urgent need for the industry. Due to the significant heterogeneity of these signals, which we summarize as the M5 problem, previous works only focus on small sub-problems and employ specialized models, failing to utilize the synergies between modalities and the powerful scaling law. However, we argue that the M5 signals can be modeled in a unified manner due to the intrinsic similarity. As a result, we propose FISHER, a Foundation model for multi-modal Industrial Signal compreHEnsive Representation. To support arbitrary sampling rates, FISHER considers the increment of sampling rate as the concatenation of sub-band information. Specifically, FISHER takes the STFT sub-band as the modeling unit and adopts a teacher student SSL framework for pre-training. We also develop the RMIS benchmark, which evaluates the representations of M5 industrial signals on multiple health management tasks. Compared with top SSL models, FISHER showcases versatile and outstanding capabilities with a general performance gain up to 5.03%, along with much more efficient scaling curves. We also investigate the scaling law on downstream tasks and derive potential avenues for future works. FISHER is now open-sourced on https://github.com/jianganbai/FISHER
中文: 随着SCADA系统的快速部署,工业信号分析需求日益迫切;为此提出的FISHER基础模型通过子带建模和师生自监督框架,在多模态工业信号表征上实现最高5.03%的性能提升,展现出卓越的泛化能力。
English: The rapid expansion of SCADA systems necessitates effective analysis of heterogeneous industrial signals, leading to the development of FISHER, a unified foundation model that leverages a teacher-student SSL framework and sub-band processing to achieve versatile performance gains of up to 5.03% over specialized models.
Authors:Zongzheng Zhang, Jiawen Yang, Ziqiao Peng, Meng Yang, Jianzhu Ma, Lin Cheng, Huazhe Xu, Hang Zhao, Hao Zhao
Abstract:
Previous animatronic faces struggle to express emotions effectively due to hardware and software limitations. On the hardware side, earlier approaches either use rigid-driven mechanisms, which provide precise control but are difficult to design within constrained spaces, or tendon-driven mechanisms, which are more space-efficient but challenging to control. In contrast, we propose a hybrid actuation approach that combines the best of both worlds. The eyes and mouth-key areas for emotional expression-are controlled using rigid mechanisms for precise movement, while the nose and cheek, which convey subtle facial microexpressions, are driven by strings. This design allows us to build a compact yet versatile hardware platform capable of expressing a wide range of emotions. On the algorithmic side, our method introduces a self-modeling network that maps motor actions to facial landmarks, allowing us to automatically establish the relationship between blendshape coefficients for different facial expressions and the corresponding motor control signals through gradient backpropagation. We then train a neural network to map speech input to corresponding blendshape controls. With our method, we can generate distinct emotional expressions such as happiness, fear, disgust, and anger, from any given sentence, each with nuanced, emotion-specific control signals-a feature that has not been demonstrated in earlier systems. We release the hardware design and code at https://github.com/ZZongzheng0918/Morpheus-Hardware and https://github.com/ZZongzheng0918/Morpheus-Software.
中文: 本研究提出了一种结合刚性和肌腱驱动机制的混合驱动系统,用于实现精确细腻的面部表情,并开发了自建模神经网络将语音映射为特定情感的运动控制信号,从而能够从任意语句生成多样化的情感表达。
English: This study introduces a hybrid actuation system combining rigid and tendon-driven mechanisms for precise and nuanced facial expressions, alongside a self-modeling neural network that maps speech to emotion-specific motor controls, enabling diverse emotional displays from any sentence.
Authors:Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu
Abstract:
This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.
Chinese: Step-Audio 2是一种端到端多模态大语言模型,通过融合潜在音频编码、强化学习和检索增强生成技术,在音频理解和语音对话任务中实现了业界领先的性能表现。
English: Step-Audio 2 is an end-to-end multimodal large language model that achieves state-of-the-art performance in audio understanding and speech conversation by integrating latent audio encoding, reinforcement learning, and retrieval-augmented generation.
Authors:Meng Lou, Yunxiang Fu, Yizhou Yu
Abstract:
Transformers and Mamba, initially invented for natural language processing, have inspired backbone architectures for visual recognition. Recent studies integrated Local Attention Transformers with Mamba to capture both local details and global contexts. Despite competitive performance, these methods are limited to simple stacking of Transformer and Mamba layers without any interaction mechanism between them. Thus, deep integration between Transformer and Mamba layers remains an open problem. We address this problem by proposing A2Mamba, a powerful Transformer-Mamba hybrid network architecture, featuring a new token mixer termed Multi-scale Attention-augmented State Space Model (MASS), where multi-scale attention maps are integrated into an attention-augmented SSM (A2SSM). A key step of A2SSM performs a variant of cross-attention by spatially aggregating the SSM's hidden states using the multi-scale attention maps, which enhances spatial dependencies pertaining to a two-dimensional space while improving the dynamic modeling capabilities of SSMs. Our A2Mamba outperforms all previous ConvNet-, Transformer-, and Mamba-based architectures in visual recognition tasks. For instance, A2Mamba-L achieves an impressive 86.1% top-1 accuracy on ImageNet-1K. In semantic segmentation, A2Mamba-B exceeds CAFormer-S36 by 2.5% in mIoU, while exhibiting higher efficiency. In object detection and instance segmentation with Cascade Mask R-CNN, A2Mamba-S surpasses MambaVision-B by 1.2%/0.9% in AP^b/AP^m, while having 40% less parameters. Code is publicly available at https://github.com/LMMMEng/A2Mamba.
中文: 提出的A2Mamba架构通过新型多尺度注意力增强状态空间模型(MASS)实现了Transformer与Mamba层的深度融合,在多项视觉识别任务中以更高效率达到最优性能。
English: The proposed A2Mamba architecture integrates Transformer and Mamba layers through a novel Multi-scale Attention-augmented State Space Model (MASS), achieving state-of-the-art performance across multiple visual recognition tasks with superior efficiency.
Authors:Xueming Fu, Pei Wu, Yingtai Li, Xin Luo, Zihang Jiang, Junhao Mei, Jian Lu, Gao-Jun Teng, S. Kevin Zhou
Abstract:
Accurate analysis of cardiac motion is crucial for evaluating cardiac function. While dynamic cardiac magnetic resonance imaging (CMR) can capture detailed tissue motion throughout the cardiac cycle, the fine-grained 4D cardiac motion tracking remains challenging due to the homogeneous nature of myocardial tissue and the lack of distinctive features. Existing approaches can be broadly categorized into image based and representation-based, each with its limitations. Image-based methods, including both raditional and deep learning-based registration approaches, either struggle with topological consistency or rely heavily on extensive training data. Representation-based methods, while promising, often suffer from loss of image-level details. To address these limitations, we propose Dynamic 3D Gaussian Representation (Dyna3DGR), a novel framework that combines explicit 3D Gaussian representation with implicit neural motion field modeling. Our method simultaneously optimizes cardiac structure and motion in a self-supervised manner, eliminating the need for extensive training data or point-to-point correspondences. Through differentiable volumetric rendering, Dyna3DGR efficiently bridges continuous motion representation with image-space alignment while preserving both topological and temporal consistency. Comprehensive evaluations on the ACDC dataset demonstrate that our approach surpasses state-of-the-art deep learning-based diffeomorphic registration methods in tracking accuracy. The code will be available in https://github.com/windrise/Dyna3DGR.
Chinese: 提出的动态3D高斯表示(Dyna3DGR)框架结合显式3D高斯表示与隐式神经运动场,以自监督方式实现精准心脏运动追踪,在ACDC数据集上超越现有方法且无需大量训练数据。
English: The proposed Dynamic 3D Gaussian Representation (Dyna3DGR) framework combines explicit 3D Gaussian representation with implicit neural motion fields to achieve accurate cardiac motion tracking in a self-supervised manner, outperforming existing methods on the ACDC dataset without requiring extensive training data.
Authors:Xiaojiao Xiao, Qinmin Vivian Hu, Guanghui Wang
Abstract:
Medical image synthesis plays a crucial role in clinical workflows, addressing the common issue of missing imaging modalities due to factors such as extended scan times, scan corruption, artifacts, patient motion, and intolerance to contrast agents. The paper presents a novel image synthesis network, the Pyramid Hierarchical Masked Diffusion Model (PHMDiff), which employs a multi-scale hierarchical approach for more detailed control over synthesizing high-quality images across different resolutions and layers. Specifically, this model utilizes randomly multi-scale high-proportion masks to speed up diffusion model training, and balances detail fidelity and overall structure. The integration of a Transformer-based Diffusion model process incorporates cross-granularity regularization, modeling the mutual information consistency across each granularity's latent spaces, thereby enhancing pixel-level perceptual accuracy. Comprehensive experiments on two challenging datasets demonstrate that PHMDiff achieves superior performance in both the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), highlighting its capability to produce high-quality synthesized images with excellent structural integrity. Ablation studies further confirm the contributions of each component. Furthermore, the PHMDiff model, a multi-scale image synthesis framework across and within medical imaging modalities, shows significant advantages over other methods. The source code is available at https://github.com/xiaojiao929/PHMDiff
中文摘要:本文提出PHMDiff模型,通过多尺度掩码扩散和跨粒度正则化方法,在加速训练的同时提升了医学图像合成的质量与结构完整性,在多个数据集上表现出优越性能。
English Summary: The paper introduces PHMDiff, a pyramid hierarchical masked diffusion model that accelerates training and enhances medical image synthesis quality by employing multi-scale masked diffusion and cross-granularity regularization, achieving superior performance on benchmark datasets.
Authors:Shang Liu, Chenjie Cao, Chaohui Yu, Wen Qian, Jing Wang, Fan Wang
Abstract:
Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth's surface, remains an open challenge. We address this through a dual innovation in data infrastructure and model architecture. First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m x 600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames. Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity. Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation: 1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently. Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation. The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D. Our project page is available at https://whiteinblue.github.io/earthcrafter/
中文摘要:本研究提出了迄今最大的三维航拍数据集Aerial-Earth3D,并开发了EarthCrafter框架,通过稀疏解耦的潜在扩散技术实现大规模三维地球生成,在提升计算效率的同时保持了地理合理性。
English Summary: The study introduces Aerial-Earth3D, the largest 3D aerial dataset, and EarthCrafter, a novel framework using sparse-decoupled latent diffusion to enable large-scale 3D Earth generation with enhanced computational efficiency and geographic accuracy.
Authors:Abhash Kumar Jha, Shakiba Moradian, Arjun Krishnakumar, Martin Rapp, Frank Hutter
Abstract:
Gradient-based one-shot neural architecture search (NAS) has significantly reduced the cost of exploring architectural spaces with discrete design choices, such as selecting operations within a model. However, the field faces two major challenges. First, evaluations of gradient-based NAS methods heavily rely on the DARTS benchmark, despite the existence of other available benchmarks. This overreliance has led to saturation, with reported improvements often falling within the margin of noise. Second, implementations of gradient-based one-shot NAS methods are fragmented across disparate repositories, complicating fair and reproducible comparisons and further development. In this paper, we introduce Configurable Optimizer (confopt), an extensible library designed to streamline the development and evaluation of gradient-based one-shot NAS methods. Confopt provides a minimal API that makes it easy for users to integrate new search spaces, while also supporting the decomposition of NAS optimizers into their core components. We use this framework to create a suite of new DARTS-based benchmarks, and combine them with a novel evaluation protocol to reveal a critical flaw in how gradient-based one-shot NAS methods are currently assessed. The code can be found at https://github.com/automl/ConfigurableOptimizer.
中文: 本文介绍了可配置优化器(confopt),这是一个可扩展的库,旨在解决基于梯度的单次神经架构搜索中对DARTS基准的过度依赖和实现碎片化问题,通过支持新搜索空间的便捷集成和分解NAS优化器核心组件,同时利用新型基准和评估协议揭示了当前评估方法中的关键缺陷。
English: This paper introduces Configurable Optimizer (confopt), an extensible library that addresses the overreliance on the DARTS benchmark and fragmented implementations in gradient-based one-shot neural architecture search by enabling easy integration of new search spaces and decomposing NAS optimizers, while also revealing a critical flaw in current evaluation methods through novel benchmarks and protocols.
Authors:Xiaoyan Wang, Zeju Li, Yifan Xu, Jiaxing Qi, Zhifei Yang, Ruifei Ma, Xiangde Liu, Chao Zhang
Abstract:
New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting independent objects to perform these tasks, which limits their spatial awareness due to insufficient representation of the richness inherent in 3D scenes. To overcome these limitations, we propose Spatial 3D-LLM, a 3D MLLM specifically designed to enhance spatial awareness for 3D vision-language tasks by enriching the spatial embeddings of 3D scenes. Spatial 3D-LLM integrates an LLM backbone with a progressive spatial awareness scheme that progressively captures spatial information as the perception field expands, generating location-enriched 3D scene embeddings to serve as visual prompts. Furthermore, we introduce two novel tasks: 3D object distance measurement and 3D layout editing, and construct a 3D instruction dataset, MODEL, to evaluate the model's spatial awareness capabilities. Experimental results demonstrate that Spatial 3D-LLM achieves state-of-the-art performance across a wide range of 3D vision-language tasks, revealing the improvements stemmed from our progressive spatial awareness scheme of mining more profound spatial information. Our code is available at https://github.com/bjshuyuan/Spatial-3D-LLM.
中文: 提出的Spatial 3D-LLM通过渐进式空间嵌入增强3D视觉语言任务的空间感知能力,并借助新任务和专用数据集验证了其领先性能。
English: The proposed Spatial 3D-LLM enhances spatial awareness in 3D vision-language tasks through progressive spatial embeddings and achieves state-of-the-art performance, as validated by novel tasks and a dedicated dataset.
Authors:Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai
Abstract:
As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad **Range**, wide **Reach**, and high **Rigor**, yet they often face two major challenges: **data leakage risks** that compromise benchmarking validity, and **evaluation inefficiency** due to large-scale testing. To address these issues, we introduce the **Ever-Evolving Science Exam (EESE)**, a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public **EESE-Pool** with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring **Range**, **Reach**, and **Rigor**, 2) a periodically updated 500-instance subset **EESE**, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.
Chinese: 本文提出的“持续演进科学考试(EESE)”是一个动态基准,通过构建大规模非公开题库和定期更新子集,有效解决数据泄露和评估效率问题,为评估基础模型的科学能力提供了可靠、可扩展的解决方案。
English: The Ever-Evolving Science Exam (EESE) is introduced as a dynamic benchmark to reliably evaluate the scientific capabilities of foundation models, addressing data leakage and inefficiency issues through a large, non-public question pool and periodic updates for leakage-resilient, low-overhead assessments.
Authors:Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai
Abstract:
As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad Range, wide Reach, and high Rigor, yet they often face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.
Chinese: 本文提出的“持续演进科学考试(EESE)”是一个动态基准,通过构建大规模非公开题库和定期更新子集,有效解决数据泄露和评估效率问题,为评估基础模型的科学能力提供了可靠、可扩展的解决方案。
English: The Ever-Evolving Science Exam (EESE) is introduced as a dynamic benchmark to reliably evaluate the scientific capabilities of foundation models, addressing data leakage and inefficiency issues through a large, non-public question pool and periodic updates for leakage-resilient, low-overhead assessments.
Authors:Fabrizio Nunnari, Shailesh Mishra, Patrick Gebhard
Abstract:
This paper describes the MMS-Player, an open source software able to synthesise sign language animations from a novel sign language representation format called MMS (MultiModal Signstream). The MMS enhances gloss-based representations by adding information on parallel execution of signs, timing, and inflections. The implementation consists of Python scripts for the popular Blender 3D authoring tool and can be invoked via command line or HTTP API. Animations can be rendered as videos or exported in other popular 3D animation exchange formats. The software is freely available under GPL-3.0 license at https://github.com/DFKI-SignLanguage/MMS-Player.
中文: MMS-Player 是一款开源软件,它利用MMS格式合成手语动画,该格式通过添加并行执行、计时和变形信息来增强基于注释的表示,并通过Blender中的Python脚本实现,可通过命令行或HTTP API访问。
English: The MMS-Player is an open-source software that synthesizes sign language animations using the MMS format, which improves gloss-based representations with details on parallel execution, timing, and inflections, and it is implemented via Python scripts in Blender, accessible through command line or HTTP API.
Authors:Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, Jin Xie
Abstract:
Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at https://github.com/DengKaiCQ/VGGT-Long.
Chinese: VGGT-Long 是一种创新系统,通过分块处理和轻量级优化克服内存限制,实现了户外环境千米级单目三维重建,无需相机标定或模型重训练即可达到与传统方法相当的性能。
English: VGGT-Long is a novel system that overcomes memory limitations to enable kilometer-scale monocular 3D reconstruction in outdoor environments through chunk-based processing and lightweight optimization, achieving performance comparable to traditional methods without requiring camera calibration or retraining.
Authors:Elza Strazda, Gerasimos Spanakis
Abstract:
Warning: This paper contains explicit statements of offensive stereotypes which might be upsetting.
Language models are prone to exhibiting biases, further amplifying unfair and harmful stereotypes. Given the fast-growing popularity and wide application of these models, it is necessary to ensure safe and fair language models. As of recent considerable attention has been paid to measuring bias in language models, yet the majority of studies have focused only on English language. A Dutch version of the US-specific CrowS-Pairs dataset for measuring bias in Dutch language models is introduced. The resulting dataset consists of 1463 sentence pairs that cover bias in 9 categories, such as Sexual orientation, Gender and Disability. The sentence pairs are composed of contrasting sentences, where one of the sentences concerns disadvantaged groups and the other advantaged groups. Using the Dutch CrowS-Pairs dataset, we show that various language models, BERTje, RobBERT, multilingual BERT, GEITje and Mistral-7B exhibit substantial bias across the various bias categories. Using the English and French versions of the CrowS-Pairs dataset, bias was evaluated in English (BERT and RoBERTa) and French (FlauBERT and CamemBERT) language models, and it was shown that English models exhibit the most bias, whereas Dutch models the least amount of bias. Additionally, results also indicate that assigning a persona to a language model changes the level of bias it exhibits. These findings highlight the variability of bias across languages and contexts, suggesting that cultural and linguistic factors play a significant role in shaping model biases.
中文: 本研究引入荷兰语版CrowS-Pairs数据集评估语言模型偏见,发现各类模型均存在显著偏见,其中英语模型偏见程度最高而荷兰语模型最低,同时角色设定也会改变模型的偏见表现。
English: This study introduces a Dutch version of the CrowS-Pairs dataset to measure bias in Dutch language models, revealing that models exhibit significant bias across categories, with English models showing the most bias and Dutch models the least, while persona assignment also influences bias levels.
Authors:Kahim Wong, Jicheng Zhou, Haiwei Wu, Yain-Whar Si, Jiantao Zhou
Abstract:
The advancement of image editing tools has enabled malicious manipulation of sensitive document images, underscoring the need for robust document image forgery detection.Though forgery detectors for natural images have been extensively studied, they struggle with document images, as the tampered regions can be seamlessly blended into the uniform document background (BG) and structured text. On the other hand, existing document-specific methods lack sufficient robustness against various degradations, which limits their practical deployment. This paper presents ADCD-Net, a robust document forgery localization model that adaptively leverages the RGB/DCT forensic traces and integrates key characteristics of document images. Specifically, to address the DCT traces' sensitivity to block misalignment, we adaptively modulate the DCT feature contribution based on a predicted alignment score, resulting in much improved resilience to various distortions, including resizing and cropping. Also, a hierarchical content disentanglement approach is proposed to boost the localization performance via mitigating the text-BG disparities. Furthermore, noticing the predominantly pristine nature of BG regions, we construct a pristine prototype capturing traces of untampered regions, and eventually enhance both the localization accuracy and robustness. Our proposed ADCD-Net demonstrates superior forgery localization performance, consistently outperforming state-of-the-art methods by 20.79\% averaged over 5 types of distortions. The code is available at https://github.com/KAHIMWONG/ACDC-Net.
中文: 本文提出ADCD-Net模型,通过自适应融合RGB/DCT取证特征并整合文档图像特性,在多种失真条件下实现卓越的伪造区域定位性能,显著优于现有方法。
English: This paper introduces ADCD-Net, a robust document forgery detection model that adaptively combines RGB/DCT forensic traces with document characteristics to significantly outperform existing methods in localizing tampered regions under various distortions.
Authors:Xian Mo, Fei Liu, Rui Tang, Jintao, Gao, Hao Liu
Abstract:
Multimedia recommendations aim to use rich multimedia content to enhance historical user-item interaction information, which can not only indicate the content relatedness among items but also reveal finer-grained preferences of users. In this paper, we propose a Knowledge-aware Diffusion-Enhanced architecture using contrastive learning paradigms (KDiffE) for multimedia recommendations. Specifically, we first utilize original user-item graphs to build an attention-aware matrix into graph neural networks, which can learn the importance between users and items for main view construction. The attention-aware matrix is constructed by adopting a random walk with a restart strategy, which can preserve the importance between users and items to generate aggregation of attention-aware node features. Then, we propose a guided diffusion model to generate strongly task-relevant knowledge graphs with less noise for constructing a knowledge-aware contrastive view, which utilizes user embeddings with an edge connected to an item to guide the generation of strongly task-relevant knowledge graphs for enhancing the item's semantic information. We perform comprehensive experiments on three multimedia datasets that reveal the effectiveness of our KDiffE and its components on various state-of-the-art methods. Our source codes are available https://github.com/1453216158/KDiffE.
中文: 本文提出KDiffE模型,通过知识感知的扩散增强架构和对比学习,结合用户-物品交互图与语义知识图谱,有效提升了多媒体推荐系统的性能。
English: This paper introduces KDiffE, a knowledge-aware diffusion-enhanced model that leverages contrastive learning and graph neural networks to improve multimedia recommendations by integrating user-item interactions and semantic knowledge graphs.
Authors:Kuo-Cheng Wu, Guohang Zhuang, Jinyang Huang, Xiang Zhang, Wanli Ouyang, Yan Lu
Abstract:
Super-resolution (SR) advances astronomical imaging by enabling cost-effective high-resolution capture, crucial for detecting faraway celestial objects and precise structural analysis. However, existing datasets for astronomical SR (ASR) exhibit three critical limitations: flux inconsistency, object-crop setting, and insufficient data diversity, significantly impeding ASR development. We propose STAR, a large-scale astronomical SR dataset containing 54,738 flux-consistent star field image pairs covering wide celestial regions. These pairs combine Hubble Space Telescope high-resolution observations with physically faithful low-resolution counterparts generated through a flux-preserving data generation pipeline, enabling systematic development of field-level ASR models. To further empower the ASR community, STAR provides a novel Flux Error (FE) to evaluate SR models in physical view. Leveraging this benchmark, we propose a Flux-Invariant Super Resolution (FISR) model that could accurately infer the flux-consistent high-resolution images from input photometry, suppressing several SR state-of-the-art methods by 24.84% on a novel designed flux consistency metric, showing the priority of our method for astrophysics. Extensive experiments demonstrate the effectiveness of our proposed method and the value of our dataset. Code and models are available at https://github.com/GuoCheng12/STAR.
中文: 提出的STAR数据集通过提供通量一致的图像对并引入新的通量误差度量,解决了天文超分辨率中的关键局限,同时配套的FISR模型在保持物理通量特性方面展现出优越性能。
English: The proposed STAR dataset addresses critical limitations in astronomical super-resolution by providing flux-consistent image pairs and introducing a novel Flux Error metric, while the accompanying FISR model demonstrates superior performance in preserving physical flux properties.
Authors:Hailin Yue, Hulin Kuang, Jin Liu, Junjian Li, Lanlan Wang, Mengshen He, Jianxin Wang
Abstract:
Accurately predicting the survival of cancer patients is crucial for personalized treatment. However, existing studies focus solely on the relationships between samples with known survival risks, without fully leveraging the value of censored samples. Furthermore, these studies may suffer performance degradation in modality-missing scenarios and even struggle during the inference process. In this study, we propose a bipartite patient-modality graph learning with event-conditional modelling of censoring for cancer survival prediction (CenSurv). Specifically, we first use graph structure to model multimodal data and obtain representation. Then, to alleviate performance degradation in modality-missing scenarios, we design a bipartite graph to simulate the patient-modality relationship in various modality-missing scenarios and leverage a complete-incomplete alignment strategy to explore modality-agnostic features. Finally, we design a plug-and-play event-conditional modeling of censoring (ECMC) that selects reliable censored data using dynamic momentum accumulation confidences, assigns more accurate survival times to these censored data, and incorporates them as uncensored data into training. Comprehensive evaluations on 5 publicly cancer datasets showcase the superiority of CenSurv over the best state-of-the-art by 3.1% in terms of the mean C-index, while also exhibiting excellent robustness under various modality-missing scenarios. In addition, using the plug-and-play ECMC module, the mean C-index of 8 baselines increased by 1.3% across 5 datasets. Code of CenSurv is available at https://github.com/yuehailin/CenSurv.
Chinese: 本研究提出CenSurv模型,通过构建患者-模态二分图并结合事件条件删失建模,有效利用删失数据并在模态缺失场景下保持鲁棒性,将癌症生存预测的平均C指数提升了3.1%,显著优于现有最佳方法。
English: This study introduces CenSurv, a bipartite patient-modality graph learning model with event-conditional censoring modeling, which enhances cancer survival prediction by effectively utilizing censored data and maintaining robustness in modality-missing scenarios, achieving a 3.1% improvement in mean C-index over state-of-the-art methods.
Authors:Jinquan Guan, Junhong Guo, Qi Chen, Jian Chen, Yongkang Cai, Yilin He, Zhiquan Huang, Yan Wang, Yutong Xie
Abstract:
Oral Squamous Cell Carcinoma (OSCC) is a prevalent and aggressive malignancy where deep learning-based computer-aided diagnosis and prognosis can enhance clinical assessments.However, existing publicly available OSCC datasets often suffer from limited patient cohorts and a restricted focus on either diagnostic or prognostic tasks, limiting the development of comprehensive and generalizable models. To bridge this gap, we introduce Multi-OSCC, a new histopathology image dataset comprising 1,325 OSCC patients, integrating both diagnostic and prognostic information to expand existing public resources. Each patient is represented by six high resolution histopathology images captured at x200, x400, and x1000 magnifications-two per magnification-covering both the core and edge tumor regions.The Multi-OSCC dataset is richly annotated for six critical clinical tasks: recurrence prediction (REC), lymph node metastasis (LNM), tumor differentiation (TD), tumor invasion (TI), cancer embolus (CE), and perineural invasion (PI). To benchmark this dataset, we systematically evaluate the impact of different visual encoders, multi-image fusion techniques, stain normalization, and multi-task learning frameworks. Our analysis yields several key insights: (1) The top-performing models achieve excellent results, with an Area Under the Curve (AUC) of 94.72% for REC and 81.23% for TD, while all tasks surpass 70% AUC; (2) Stain normalization benefits diagnostic tasks but negatively affects recurrence prediction; (3) Multi-task learning incurs a 3.34% average AUC degradation compared to single-task models in our multi-task benchmark, underscoring the challenge of balancing multiple tasks in our dataset. To accelerate future research, we publicly release the Multi-OSCC dataset and baseline models at https://github.com/guanjinquan/OSCC-PathologyImageDataset.
中文: Multi-OSCC数据集通过整合1325名口腔癌患者的诊断与预后信息,填补了现有公共数据的不足,在复发预测中取得94.72%的AUC优异性能,同时揭示了染色标准化对预后预测的负面影响及多任务学习的平衡难题。
English: The Multi-OSCC dataset addresses limitations in existing oral cancer data by providing comprehensive histopathology images from 1,325 patients with diagnostic and prognostic annotations, achieving up to 94.72% AUC in recurrence prediction while revealing key insights about stain normalization and multi-task learning challenges.
Authors:Yumeng Wang, Zengyi Wo, Wenjun Wang, Xingcheng Fu, Minglai Shao
Abstract:
Graph Neural Networks (GNNs) excel in node classification tasks but often assume homophily, where connected nodes share similar labels. This assumption does not hold in many real-world heterophilic graphs. Existing models for heterophilic graphs primarily rely on pairwise relationships, overlooking multi-scale information from higher-order structures. This leads to suboptimal performance, particularly under noise from conflicting class information across nodes. To address these challenges, we propose HPGNN, a novel model integrating Higher-order Personalized PageRank with Graph Neural Networks. HPGNN introduces an efficient high-order approximation of Personalized PageRank (PPR) to capture long-range and multi-scale node interactions. This approach reduces computational complexity and mitigates noise from surrounding information. By embedding higher-order structural information into convolutional networks, HPGNN effectively models key interactions across diverse graph dimensions. Extensive experiments on benchmark datasets demonstrate HPGNN's effectiveness. The model achieves better performance than five out of seven state-of-the-art methods on heterophilic graphs in downstream tasks while maintaining competitive performance on homophilic graphs. HPGNN's ability to balance multi-scale information and robustness to noise makes it a versatile solution for real-world graph learning challenges. Codes are available at https://github.com/streetcorner/HPGNN.
中文: HPGNN模型将高阶个性化PageRank与图神经网络相结合,有效捕捉多尺度结构信息,在异配性图数据上表现优异且具备抗噪能力。
English: HPGNN integrates higher-order Personalized PageRank with Graph Neural Networks to effectively capture multi-scale structural information, demonstrating superior performance on heterophilic graphs while maintaining robustness against noise.
Authors:Yumeng Wang, Zengyi Wo, Wenjun Wang, Xingcheng Fu, Minglai Shao
Abstract:
Graph Neural Networks (GNNs) excel in node classification tasks but often assume homophily, where connected nodes share similar labels. This assumption does not hold in many real-world heterophilic graphs. Existing models for heterophilic graphs primarily rely on pairwise relationships, overlooking multi-scale information from higher-order structures. This leads to suboptimal performance, particularly under noise from conflicting class information across nodes. To address these challenges, we propose HPGNN, a novel model integrating Higher-order Personalized PageRank with Graph Neural Networks. HPGNN introduces an efficient high-order approximation of Personalized PageRank (PPR) to capture long-range and multi-scale node interactions. This approach reduces computational complexity and mitigates noise from surrounding information. By embedding higher-order structural information into convolutional networks, HPGNN effectively models key interactions across diverse graph dimensions. Extensive experiments on benchmark datasets demonstrate HPGNN's effectiveness. The model achieves better performance than five out of seven state-of-the-art methods on heterophilic graphs in downstream tasks while maintaining competitive performance on homophilic graphs. HPGNN's ability to balance multi-scale information and robustness to noise makes it a versatile solution for real-world graph learning challenges. Codes are available at https://github.com/streetcorner/HPGNN.
中文: HPGNN模型将高阶个性化PageRank与图神经网络相结合,有效捕捉多尺度结构信息,在异配性图数据上表现优异且具备抗噪能力。
English: HPGNN integrates higher-order Personalized PageRank with Graph Neural Networks to effectively capture multi-scale structural information, demonstrating superior performance on heterophilic graphs while maintaining robustness against noise.
Authors:Joseph De Mathia, Carlos Francisco Moreno-GarcÃa
Abstract:
In an era where wearable technology is reshaping applications, Scene Text Detection and Recognition (STDR) becomes a straightforward choice through the lens of egocentric vision. Leveraging Meta's Project Aria smart glasses, this paper investigates how environmental variables, such as lighting, distance, and resolution, affect the performance of state-of-the-art STDR algorithms in real-world scenarios. We introduce a novel, custom-built dataset captured under controlled conditions and evaluate two OCR pipelines: EAST with CRNN, and EAST with PyTesseract. Our findings reveal that resolution and distance significantly influence recognition accuracy, while lighting plays a less predictable role. Notably, image upscaling emerged as a key pre-processing technique, reducing Character Error Rate (CER) from 0.65 to 0.48. We further demonstrate the potential of integrating eye-gaze tracking to optimise processing efficiency by focusing on user attention zones. This work not only benchmarks STDR performance under realistic conditions but also lays the groundwork for adaptive, user-aware AR systems. Our contributions aim to inspire future research in robust, context-sensitive text recognition for assistive and research-oriented applications, such as asset inspection and nutrition analysis. The code is available at https://github.com/josepDe/Project_Aria_STR.
中文摘要:本研究利用Meta的Project Aria智能眼镜评估光照、距离和分辨率对场景文本检测与识别的影响,发现分辨率和距离显著影响识别准确率而光照作用较难预测,并证明图像超分辨率和眼动追踪可通过聚焦用户注视区域提升自适应增强现实系统的性能。
English Summary: This study uses Meta's Project Aria smart glasses to evaluate how lighting, distance, and resolution affect Scene Text Detection and Recognition, finding that resolution and distance significantly impact accuracy while lighting is less predictable, and demonstrates that image upscaling and eye-gaze tracking can enhance performance for adaptive AR systems.
Authors:Kailai Zhou, Fuqiang Yang, Shixian Wang, Bihan Wen, Chongde Zi, Linsen Chen, Qiu Shen, Xun Cao
Abstract:
RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene's generalizability across eleven datasets for four RGBT downstream tasks. The code will be available at https://github.com/CalayZhou/M-SpecGene.
中文: 研究者提出M-SpecGene通用多光谱基础模型,通过跨模态结构稀疏性掩码策略学习模态不变表征,在四大类下游任务的十一个数据集上验证了其泛化能力。
English: Researchers introduce M-SpecGene, a generalized RGB-thermal foundation model that learns modality-invariant representations through a novel cross-modality structural sparsity masking strategy, demonstrating strong performance across diverse downstream tasks.
Authors:Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, Chengfei Lyu
Abstract:
Recent advances in dense 3D reconstruction have led to significant progress, yet achieving accurate unified geometric prediction remains a major challenge. Most existing methods are limited to predicting a single geometry quantity from input images. However, geometric quantities such as depth, surface normals, and point maps are inherently correlated, and estimating them in isolation often fails to ensure consistency, thereby limiting both accuracy and practical applicability. This motivates us to explore a unified framework that explicitly models the structural coupling among different geometric properties to enable joint regression. In this paper, we present Dens3R, a 3D foundation model designed for joint geometric dense prediction and adaptable to a wide range of downstream tasks. Dens3R adopts a two-stage training framework to progressively build a pointmap representation that is both generalizable and intrinsically invariant. Specifically, we design a lightweight shared encoder-decoder backbone and introduce position-interpolated rotary positional encoding to maintain expressive power while enhancing robustness to high-resolution inputs. By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities such as surface normals and depth, achieving consistent geometry perception from single-view to multi-view inputs. Additionally, we propose a post-processing pipeline that supports geometrically consistent multi-view inference. Extensive experiments demonstrate the superior performance of Dens3R across various dense 3D prediction tasks and highlight its potential for broader applications.
中文: Dens3R作为一种统一的三维基础模型,通过共享编码器-解码器框架联合回归深度与表面法向量等关联几何量,实现了从单视图到多视图输入的一致性几何感知。
English: Dens3R is a unified 3D foundation model that jointly regresses correlated geometric quantities like depth and surface normals through a shared encoder-decoder framework, achieving consistent geometry perception from single to multi-view inputs.
Authors:Danil Gusak, Anna Volodkevich, Anton Klenitskiy, Alexey Vasilev, Evgeny Frolov
Abstract:
Modern sequential recommender systems, ranging from lightweight transformer-based variants to large language models, have become increasingly prominent in academia and industry due to their strong performance in the next-item prediction task. Yet common evaluation protocols for sequential recommendations remain insufficiently developed: they often fail to reflect the corresponding recommendation task accurately, or are not aligned with real-world scenarios.
Although the widely used leave-one-out split matches next-item prediction, it permits the overlap between training and test periods, which leads to temporal leakage and unrealistically long test horizon, limiting real-world relevance. Global temporal splitting addresses these issues by evaluating on distinct future periods. However, its applications to sequential recommendations remain loosely defined, particularly in terms of selecting target interactions and constructing a validation subset that provides necessary consistency between validation and test metrics.
In this paper, we demonstrate that evaluation outcomes can vary significantly across splitting strategies, influencing model rankings and practical deployment decisions. To improve reproducibility in both academic and industrial settings, we systematically compare different splitting strategies for sequential recommendations across multiple datasets and established baselines. Our findings show that prevalent splits, such as leave-one-out, may be insufficiently aligned with more realistic evaluation strategies. Code: https://github.com/monkey0head/time-to-split
中文摘要:当前序列推荐系统的评估方法,如留一法分割,常存在时间泄露和不切实际的测试周期问题,因此需要采用更贴近实际的策略,如全局时间分割,以确保模型评估的准确性。
English Summary: Current evaluation methods for sequential recommender systems, like leave-one-out splitting, often suffer from temporal leakage and unrealistic test horizons, prompting a need for more realistic strategies such as global temporal splitting to ensure accurate model assessments.
Authors:Tianze Xu, Pengrui Lu, Lyumanshan Ye, Xiangkun Hu, Pengfei Liu
Abstract:
The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems - which we refer to as Deep AI Research Systems (DARS) - on frontier AI scientific questions. We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios such as laboratory discussions and interviews, spanning 35 different AI subjects and categorized into three types: technical details, literature review, and open consulting. Our dual evaluation framework combines rubric assessment, which uses expert-designed criteria to evaluate insight quality, with factual assessment, which measures citation accuracy (faithfulness) and coverage (groundedness). We evaluated several leading commercial DARS and baseline systems. Results show that OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions. Such capabilities represent a meaningful step toward AI self-improvement, aligning with the vision of ASI for AI. We open-source ResearcherBench to provide a standardized platform for promoting the development of next-generation AI research assistants, hoping to foster a new perspective in AI research evaluation for a novel pattern of scientific collaboration: https://github.com/GAIR-NLP/ResearcherBench.
中文摘要:ResearcherBench是首个针对前沿科学问题评估深度AI研究系统(DARS)的基准,通过专家设计的评估标准与事实准确性测量相结合,推动AI研究能力向自主科学发现发展。
English Summary: ResearcherBench is introduced as the first benchmark to evaluate Deep AI Research Systems (DARS) on frontier scientific questions, combining expert-designed rubric assessments with factual accuracy measurements to advance AI research capabilities.
Authors:Yu Wang, Bo Dang, Wanchun Li, Wei Chen, Yansheng Li
Abstract:
With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces HoliTracer, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. In HoliTracer, we enhance segmentation of large-size RSI using the Context Attention Net (CAN), which employs a local-to-global attention mechanism to capture contextual dependencies. Furthermore, we achieve holistic vectorization through a robust pipeline that leverages the Mask Contour Reformer (MCR) to reconstruct polygons and the Polygon Sequence Tracer (PST) to trace vertices. Extensive experiments on large-size RSI datasets, including buildings, water bodies, and roads, demonstrate that HoliTracer outperforms state-of-the-art methods. Our code and data are available in https://github.com/vvangfaye/HoliTracer.
中文:HoliTracer是一种创新框架,通过上下文注意力网络增强分割,并采用稳健流程实现精确矢量化,能够从大幅遥感影像中整体提取地理对象矢量,性能优于现有方法。
English: HoliTracer is a novel framework that holistically extracts vectorized geographic objects from large-size remote sensing imagery using Context Attention Net for enhanced segmentation and a robust pipeline for accurate vectorization, outperforming existing methods.
Authors:Chao Zhou, Tianyi Wei, Nenghai Yu
Abstract:
Recent advancements in unified image generation models, such as OmniGen, have enabled the handling of diverse image generation and editing tasks within a single framework, accepting multimodal, interleaved texts and images in free form. This unified architecture eliminates the need for text encoders, greatly reducing model complexity and standardizing various image generation and editing tasks, making it more user-friendly. However, we found that it suffers from text instruction neglect, especially when the text instruction contains multiple sub-instructions. To explore this issue, we performed a perturbation analysis on the input to identify critical steps and layers. By examining the cross-attention maps of these key steps, we observed significant conflicts between neglected sub-instructions and the activations of the input image. In response, we propose Self-Adaptive Attention Scaling (SaaS), a method that leverages the consistency of cross-attention between adjacent timesteps to dynamically scale the attention activation for each sub-instruction. Our SaaS enhances instruction-following fidelity without requiring additional training or test-time optimization. Experimental results on instruction-based image editing and visual conditional image generation validate the effectiveness of our SaaS, showing superior instruction-following fidelity over existing methods. The code is available https://github.com/zhouchao-ops/SaaS.
Chinese: 提出的自适应注意力缩放(SaaS)方法通过动态调整每个子指令的注意力激活,有效解决了统一图像生成模型中的文本指令忽略问题,无需额外训练即可显著提升指令遵循的准确性。
English: The proposed Self-Adaptive Attention Scaling (SaaS) method effectively resolves text instruction neglect in unified image generation models by dynamically scaling attention activation for each sub-instruction, significantly improving instruction-following fidelity without requiring additional training.
Authors:Shreelekha Revankar, Utkarsh Mall, Cheng Perng Phoo, Kavita Bala, Bharath Hariharan
Abstract:
Natural disasters cause devastating damage to communities and infrastructure every year. Effective disaster response is hampered by the difficulty of accessing affected areas during and after events. Remote sensing has allowed us to monitor natural disasters in a remote way. More recently there have been advances in computer vision and deep learning that help automate satellite imagery analysis, However, they remain limited by their narrow focus on specific disaster types, reliance on manual expert interpretation, and lack of datasets with sufficient temporal granularity or natural language annotations for tracking disaster progression. We present MONITRS, a novel multimodal dataset of more than 10,000 FEMA disaster events with temporal satellite imagery and natural language annotations from news articles, accompanied by geotagged locations, and question-answer pairs. We demonstrate that fine-tuning existing MLLMs on our dataset yields significant performance improvements for disaster monitoring tasks, establishing a new benchmark for machine learning-assisted disaster response systems. Code can be found at: https://github.com/ShreelekhaR/MONITRS
中文:MONITRS数据集通过整合多模态卫星图像与自然语言标注,突破了现有灾害监测方法的局限,显著提升了机器学习模型在灾害响应任务中的性能表现。
English: The MONITRS dataset addresses limitations in current disaster monitoring by providing multimodal satellite imagery and natural language annotations, enabling significant performance improvements in machine learning models for disaster response.
Authors:Wentao Xiang, Haoxian Tan, Cong Wei, Yujie Zhong, Dengjie Li, Yujiu Yang
Abstract:
Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two dimensions: prediction type and instruction type. Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVP-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. Our framework is designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions within a single architecture. MVP-LM features an innovative multi-granularity decoder in conjunction with a CoT-inspired dataset unification strategy, enabling seamless supervised fine-tuning across a wide spectrum of tasks, including but not limited to panoptic segmentation, detection, grounding, and referring expression segmentation. Furthermore, we introduce a query enhancement strategy aimed at harnessing the decoding and generative capabilities inherent in VLLMs. Extensive experiments conducted across a range of benchmarks in both word-based and sentence-based perception tasks substantiate the efficacy of our framework. The code will be available at https://github.com/xiangwentao666/MVP-LM.
中文: 该摘要提出了MVP-LM框架,通过多粒度解码器和数据集统一策略,将基于词语和句子的感知任务与边界框及掩码预测整合于单一架构,有效提升了计算机视觉任务的通用性和性能表现。
English: The abstract introduces MVP-LM, a unified framework that integrates word- and sentence-based perception tasks with box and mask predictions using a multi-granularity decoder and dataset unification strategy to enhance versatility across computer vision applications.
Authors:Pengwei Jin, Di Huang, Chongxiao Li, Shuyao Cheng, Yang Zhao, Xinyao Zheng, Jiaguo Zhu, Shuyi Xing, Bohan Dou, Rui Zhang, Zidong Du, Qi Guo, Xing Hu
Abstract:
The automatic generation of Verilog code using Large Language Models (LLMs) has garnered significant interest in hardware design automation. However, existing benchmarks for evaluating LLMs in Verilog generation fall short in replicating real-world design workflows due to their designs' simplicity, inadequate design specifications, and less rigorous verification environments. To address these limitations, we present RealBench, the first benchmark aiming at real-world IP-level Verilog generation tasks. RealBench features complex, structured, real-world open-source IP designs, multi-modal and formatted design specifications, and rigorous verification environments, including 100% line coverage testbenches and a formal checker. It supports both module-level and system-level tasks, enabling comprehensive assessments of LLM capabilities. Evaluations on various LLMs and agents reveal that even one of the best-performing LLMs, o1-preview, achieves only a 13.3% pass@1 on module-level tasks and 0% on system-level tasks, highlighting the need for stronger Verilog generation models in the future. The benchmark is open-sourced at https://github.com/IPRC-DIP/RealBench.
中文: RealBench是首个面向真实IP级Verilog代码生成的基准测试,通过复杂设计、多模态规范和严格验证环境,解决了现有评估方法的不足,凸显了当前大语言模型在硬件设计领域的性能局限。
English: RealBench is the first benchmark designed for real-world IP-level Verilog generation, featuring complex designs, comprehensive specifications, and rigorous verification to address limitations in existing LLM evaluation methods.
Authors:Zitong Xu, Huiyu Duan, Bingnan Liu, Guangji Ma, Jiarui Wang, Liu Yang, Shiqi Gao, Xiaoyu Wang, Jia Wang, Xiongkuo Min, Guangtao Zhai, Weisi Lin
Abstract:
The rapid advancement of Text-guided Image Editing (TIE) enables image modifications through text prompts. However, current TIE models still struggle to balance image quality, editing alignment, and consistency with the original image, limiting their practical applications. Existing TIE evaluation benchmarks and metrics have limitations on scale or alignment with human perception. To this end, we introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations for evaluating TIE. Specifically, EBench-18K includes 1,080 source images with corresponding editing prompts across 21 tasks, 18K+ edited images produced by 17 state-of-the-art TIE models, 55K+ mean opinion scores (MOSs) assessed from three evaluation dimensions, and 18K+ question-answering (QA) pairs. Based on EBench-18K, we employ outstanding LMMs to assess edited images, while the evaluation results, in turn, provide insights into assessing the alignment between the LMMs' understanding ability and human preferences. Then, we propose LMM4Edit, a LMM-based metric for evaluating image Editing models from perceptual quality, editing alignment, attribute preservation, and task-specific QA accuracy in an all-in-one manner. Extensive experiments show that LMM4Edit achieves outstanding performance and aligns well with human preference. Zero-shot validation on the other datasets also shows the generalization ability of our model. The dataset and code are available at https://github.com/IntMeGroup/LMM4Edit.
中文: 本文提出了包含18K编辑图像和人工标注的EBench-18K基准数据集,并开发了LMM4Edit评估指标,该指标能全面评估编辑效果且与人类偏好高度一致,展现出优秀的泛化能力。
English: This paper introduces EBench-18K, a comprehensive benchmark with 18K edited images and human annotations, and proposes LMM4Edit, an all-in-one evaluation metric that aligns well with human preferences and demonstrates strong generalization.
Authors:Fansheng Zeng, Bineng Zhong, Haiying Xia, Yufei Tan, Xiantao Hu, Liangtao Shi, Shuxiang Song
Abstract:
Contextual reasoning with constraints is crucial for enhancing temporal consistency in cross-frame modeling for visual tracking. However, mainstream tracking algorithms typically associate context by merely stacking historical information without explicitly supervising the association process, making it difficult to effectively model the target's evolving dynamics. To alleviate this problem, we propose RSTrack, which explicitly models and supervises context reasoning via three core mechanisms. \textit{1) Context Reasoning Mechanism}: Constructs a target state reasoning pipeline, converting unconstrained contextual associations into a temporal reasoning process that predicts the current representation based on historical target states, thereby enhancing temporal consistency. \textit{2) Forward Supervision Strategy}: Utilizes true target features as anchors to constrain the reasoning pipeline, guiding the predicted output toward the true target distribution and suppressing drift in the context reasoning process. \textit{3) Efficient State Modeling}: Employs a compression-reconstruction mechanism to extract the core features of the target, removing redundant information across frames and preventing ineffective contextual associations. These three mechanisms collaborate to effectively alleviate the issue of contextual association divergence in traditional temporal modeling. Experimental results show that RSTrack achieves state-of-the-art performance on multiple benchmark datasets while maintaining real-time running speeds. Our code is available at https://github.com/GXNU-ZhongLab/RSTrack.
中文摘要:RSTrack通过上下文推理机制、前向监督策略和高效状态建模三个核心机制,显式监督上下文关联过程,有效提升视觉跟踪的时序一致性,并在多个基准数据集上取得领先性能。
English Summary: RSTrack enhances visual tracking by explicitly modeling and supervising context reasoning through three mechanisms—context reasoning, forward supervision, and efficient state modeling—to improve temporal consistency and achieve state-of-the-art performance.
Authors:Haoyin Yan, Jie Zhang, Chengqian Jiang, Shuang Zhang
Abstract:
Multichannel speech enhancement (SE) aims to restore clean speech from noisy measurements by leveraging spatiotemporal signal features. In ad-hoc array conditions, microphone invariance (MI) requires systems to handle different microphone numbers and array geometries. From a practical perspective, multichannel recordings inevitably increase the computational burden for edge-device applications, highlighting the necessity of lightweight and efficient deployments. In this work, we propose a lightweight attentive beamforming network (LABNet) to integrate MI in a low-complexity real-time SE system. We design a three-stage framework for efficient intra-channel modeling and inter-channel interaction. A cross-channel attention module is developed to aggregate features from each channel selectively. Experimental results demonstrate our LABNet achieves impressive performance with ultra-light resource overhead while maintaining the MI, indicating great potential for ad-hoc array processing. The code is available:https://github.com/Jokejiangv/LABNet.git
中文摘要:本研究提出LABNet,一种轻量级注意力波束成形网络,可在保持麦克风不变性的同时,以极低计算成本实现高效多通道语音增强,适用于自组织阵列和边缘设备。
English Summary: The study introduces LABNet, a lightweight attentive beamforming network designed for efficient multichannel speech enhancement that maintains microphone invariance with minimal computational overhead, ideal for ad-hoc arrays and edge devices.
Authors:Nand Kumar Yadav, Rodrigue Rizk, William CW Chen, KC Santosh
Abstract:
Accurate and efficient medical image segmentation is crucial but challenging due to anatomical variability and high computational demands on volumetric data. Recent hybrid CNN-Transformer architectures achieve state-of-the-art results but add significant complexity. In this paper, we propose MLRU++, a Multiscale Lightweight Residual UNETR++ architecture designed to balance segmentation accuracy and computational efficiency. It introduces two key innovations: a Lightweight Channel and Bottleneck Attention Module (LCBAM) that enhances contextual feature encoding with minimal overhead, and a Multiscale Bottleneck Block (M2B) in the decoder that captures fine-grained details via multi-resolution feature aggregation. Experiments on four publicly available benchmark datasets (Synapse, BTCV, ACDC, and Decathlon Lung) demonstrate that MLRU++ achieves state-of-the-art performance, with average Dice scores of 87.57% (Synapse), 93.00% (ACDC), and 81.12% (Lung). Compared to existing leading models, MLRU++ improves Dice scores by 5.38% and 2.12% on Synapse and ACDC, respectively, while significantly reducing parameter count and computational cost. Ablation studies evaluating LCBAM and M2B further confirm the effectiveness of the proposed architectural components. Results suggest that MLRU++ offers a practical and high-performing solution for 3D medical image segmentation tasks. Source code is available at: https://github.com/1027865/MLRUPP
中文: 本文提出的MLRU++轻量级混合架构通过创新的注意力模块和多尺度特征聚合,在保持高分割精度的同时显著降低了计算成本,为三维医学图像分割提供了高效解决方案。
English: This paper introduces MLRU++, a lightweight hybrid architecture for medical image segmentation that achieves state-of-the-art accuracy with improved computational efficiency through novel attention mechanisms and multi-scale feature aggregation.
Authors:Yaofang Liu, Yumeng Ren, Aitor Artola, Yuxuan Hu, Xiaodong Cun, Xiaotong Zhao, Alan Zhao, Raymond H. Chan, Suiyun Zhang, Rui Liu, Dandan Tu, Jean-Michel Morel
Abstract:
The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency -- surpassing the performance of Wan-I2V-14B with $\leq$ 1/200 of the training cost (\$500 vs. $\geq$ \$100,000) and $\leq$ 1/2500 of the dataset size (4K vs. $\geq$ 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32\% (vs. 86.86\% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension -- all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code is open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen
中文摘要:Pusa框架通过向量化时间步适配突破视频扩散模型中的时序建模限制,在保持基础模型能力的同时实现了卓越的效率和多功能生成能力。
English Summary: The Pusa framework introduces vectorized timestep adaptation to overcome temporal modeling limitations in video diffusion, achieving superior efficiency and multi-task capabilities while preserving base model performance.
Authors:Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, Aleksandar SamardžiÄ
Abstract:
We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at https://github.com/pytorch/ao/.
中文: TorchAO是一个基于PyTorch的模型优化框架,通过量化和稀疏化技术提供从训练到服务的端到端工作流,支持多种低精度数据类型并与广泛的开发生态系统紧密集成。
English: TorchAO is a PyTorch-native framework that integrates quantization and sparsity techniques to deliver a seamless end-to-end workflow for optimizing AI models from training to deployment, supporting various data types and ecosystem tools.
Authors:MSR Avinash, Ismael Lachheb
Abstract:
Visual Assessment of Cluster Tendency (VAT) is a widely used unsupervised technique to assess the presence of cluster structure in unlabeled datasets. However, its standard implementation suffers from significant performance limitations due to its O(n^2) time complexity and inefficient memory usage. In this work, we present Fast-VAT, a high-performance reimplementation of the VAT algorithm in Python, augmented with Numba's Just-In-Time (JIT) compilation and Cython's static typing and low-level memory optimizations. Our approach achieves up to 50x speedup over the baseline implementation, while preserving the output fidelity of the original method. We validate Fast-VAT on a suite of real and synthetic datasets -- including Iris, Mall Customers, and Spotify subsets -- and verify cluster tendency using Hopkins statistics, PCA, and t-SNE. Additionally, we compare VAT's structural insights with clustering results from DBSCAN and K-Means to confirm its reliability.
Chinese: Fast-VAT 是 VAT 算法的高性能 Python 重实现,通过 Numba 和 Cython 优化实现了高达 50 倍的加速,同时保持输出精度,并在多个数据集和聚类方法上得到验证。
English: Fast-VAT is an optimized Python reimplementation of the VAT algorithm that uses Numba and Cython to achieve up to 50x speedup while maintaining output accuracy, validated on various datasets and clustering methods.
Authors:Jaehoon Yoo, Wonjung Kim, Seunghoon Hong
Abstract:
Discrete Flow-based Models (DFMs) are powerful generative models for high-quality discrete data but typically suffer from slow sampling speeds due to their reliance on iterative decoding processes. This reliance on a multi-step process originates from the factorization approximation of DFMs, which is necessary for handling high-dimensional data. In this paper, we rigorously characterize the approximation error from factorization using Conditional Total Correlation (TC), which depends on the coupling. To reduce the Conditional TC and enable efficient few-step generation, we propose Rectified Discrete Flow (ReDi), a novel iterative method that reduces factorization error by rectifying the coupling between source and target distributions. We theoretically prove that each ReDi step guarantees a monotonic decreasing Conditional TC, ensuring its convergence. Empirically, ReDi significantly reduces Conditional TC and enables few-step generation. Moreover, we demonstrate that the rectified couplings are well-suited for training efficient one-step models on image generation. ReDi offers a simple and theoretically grounded approach for tackling the few-step challenge, providing a new perspective on efficient discrete data synthesis. Code is available at https://github.com/Ugness/ReDi_discrete
中文摘要:ReDi是一种通过修正耦合关系来减少离散流模型因式分解误差的新方法,能够实现高效少步生成并确保收敛性。
English Summary: ReDi is a novel method that reduces factorization error in Discrete Flow-based Models by rectifying couplings, enabling efficient few-step generation with guaranteed convergence.
Authors:Noah van der Vleuten
Abstract:
Language models for program synthesis are usually trained and evaluated on programming competition datasets (MBPP, APPS). However, these datasets are limited in size and quality, while these language models are extremely data hungry. Additionally, the language models have a misaligned program synthesis process compared to humans. While humans iteratively develop code with the help of a compiler, most program synthesis models currently produce code in one go. To solve these issues, we introduce a bootstrapping algorithm for program synthesis, that supports teaching models how to repair. We show that bootstrapping consistently outperforms regular fine-tuning. Compared to other work, our bootstrapped model performs on par with fine-tuned models that are 68\% larger. Notably, bootstrapping with repairing also improves non-repairing performance compared to regular bootstrapping during inference. However, on our models, repairing during inference is likely inferior to simply sampling the same number of solutions. Furthermore, we find that there are issues with the example test cases in the training portion of the APPS dataset that are valuable to the community, as many repairing and reinforcement learning methods rely on them.
Chinese: 提出的程序合成引导算法通过教授代码修复来提升模型性能,其效果优于标准微调,并能以更小模型实现相当成果,但推理时的修复策略相比多次采样存在局限性。
English: The proposed bootstrapping algorithm for program synthesis enhances model performance by teaching code repair, outperforming standard fine-tuning and achieving comparable results with significantly smaller models, though inference-time repair shows limitations compared to multiple sampling.
Authors:John Wu, Adam Cross, Jimeng Sun
Abstract:
Rare diseases affect 1 in 10 Americans, yet standard ICD coding systems fail to capture these conditions in electronic health records (EHR), leaving crucial information buried in clinical notes. Current approaches struggle with medical abbreviations, miss implicit disease mentions, raise privacy concerns with cloud processing, and lack clinical reasoning abilities. We present Rare Disease Mining Agents (RDMA), a framework that mirrors how medical experts identify rare disease patterns in EHR. RDMA connects scattered clinical observations that together suggest specific rare conditions. By handling clinical abbreviations, recognizing implicit disease patterns, and applying contextual reasoning locally on standard hardware, RDMA reduces privacy risks while improving F1 performance by upwards of 30\% and decreasing inferences costs 10-fold. This approach helps clinicians avoid the privacy risk of using cloud services while accessing key rare disease information from EHR systems, supporting earlier diagnosis for rare disease patients. Available at https://github.com/jhnwu3/RDMA.
中文:RDMA框架通过本地处理电子健康记录中的临床笔记,提升罕见疾病检测的准确性,在保护隐私的同时将F1性能提高30%以上并降低十倍推理成本,有助于实现早期诊断。
English: The RDMA framework improves rare disease detection in EHRs by processing clinical notes locally to enhance privacy, increasing F1 performance by over 30% and cutting inference costs tenfold while enabling earlier diagnoses.
Authors:Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, Yue Wang
Abstract:
Despite their fundamental role, it remains unclear what properties could make visual tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective -- reconstructing clean signals from corrupted inputs such as Gaussian noise or masking -- a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings to be more easily reconstructed even when heavily corrupted. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet effective tokenizer trained to reconstruct clean images from latent embeddings corrupted by interpolative noise and random masking. Extensive experiments on ImageNet 256x256 demonstrate that our tokenizer consistently outperforms standard tokenizers across six representative generative models. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.
Chinese: 本研究提出了潜在去噪分词器(l-DeTok),通过将分词器嵌入与去噪目标对齐,提升从受损输入中重建的能力,在ImageNet 256x256数据集上的多种生成模型中均优于标准分词器。
English: The study introduces the Latent Denoising Tokenizer (l-DeTok), which aligns tokenizer embeddings with the denoising objective to enhance reconstruction from corrupted inputs, consistently outperforming standard tokenizers across various generative models on ImageNet 256x256.
Authors:Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Abstract:
Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.
中文: 提出的Segment Concept (SeC)框架通过从外观匹配转向概念驱动推理,利用大型视觉语言模型构建鲁棒的对象表征,并在新的SeCVOS基准测试中实现了最先进的性能。
English: The proposed Segment Concept (SeC) framework advances video object segmentation by shifting from appearance-based matching to concept-driven reasoning, leveraging large vision-language models to construct robust object representations and achieving state-of-the-art performance on the new SeCVOS benchmark.
Authors:Shangke Lyu, Linjuan Wu, Yuchen Yan, Xingyu Wu, Hao Li, Yongliang Shen, Peisheng Jiang, Weiming Lu, Jun Xiao, Yueting Zhuang
Abstract:
Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet they suffer from a critical inefficiency: applying uniformly extensive reasoning regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. Unlike existing approaches that impose rigid constraints or rely on discrete mode selection, HBPO partitions the exploration space into budget-constrained hierarchies (512-2560 tokens), each with differentiated reward structures that preserve both efficiency incentives and reasoning capabilities. This design addresses a fundamental challenge in efficient reasoning training: traditional length penalties systematically bias models away from necessary long reasoning paths, causing exploration space collapse. Through hierarchical sampling and budget-aware rewards, HBPO maintains exploration diversity while teaching models to recognize when extended deliberation is warranted. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Most notably, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.
中文: HBPO框架通过分层预算策略让模型根据问题复杂度自适应调整推理深度,在四大基准测试中实现最高60.6%的令牌节省,同时准确率提升3.14%。
English: HBPO is a reinforcement learning framework that enables models to adaptively adjust reasoning depth based on problem complexity, achieving up to 60.6% token reduction while improving accuracy by 3.14% across benchmarks.
Authors:Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, Chi Jin
Abstract:
We present the LLM Economist, a novel framework that uses agent-based modeling to design and assess economic policies in strategic environments with hierarchical decision-making. At the lower level, bounded rational worker agents -- instantiated as persona-conditioned prompts sampled from U.S. Census-calibrated income and demographic statistics -- choose labor supply to maximize text-based utility functions learned in-context. At the upper level, a planner agent employs in-context reinforcement learning to propose piecewise-linear marginal tax schedules anchored to the current U.S. federal brackets. This construction endows economic simulacra with three capabilities requisite for credible fiscal experimentation: (i) optimization of heterogeneous utilities, (ii) principled generation of large, demographically realistic agent populations, and (iii) mechanism design -- the ultimate nudging problem -- expressed entirely in natural language. Experiments with populations of up to one hundred interacting agents show that the planner converges near Stackelberg equilibria that improve aggregate social welfare relative to Saez solutions, while a periodic, persona-level voting procedure furthers these gains under decentralized governance. These results demonstrate that large language model-based agents can jointly model, simulate, and govern complex economic systems, providing a tractable test bed for policy evaluation at the societal scale to help build better civilizations.
中文: LLM经济学家是一个基于智能体建模的框架,通过分层决策设计并评估经济政策,其中工人智能体优化劳动供给,规划智能体提出税收方案,实验表明基于大语言模型的智能体能够有效模拟和治理复杂经济系统,为政策评估提供可行测试平台。
English: The LLM Economist is a framework using agent-based modeling to design and evaluate economic policies through hierarchical decision-making, where worker agents optimize labor supply and a planner agent proposes tax schedules, demonstrating that large language model-based agents can effectively simulate and govern complex economic systems for policy evaluation.
Authors:Ghassen Baklouti, Julio Silva-RodrÃguez, Jose Dolz, Houda Bahig, Ismail Ben Ayed
Abstract:
Parameter-efficient fine-tuning (PEFT) of pre-trained foundation models is increasingly attracting interest in medical imaging due to its effectiveness and computational efficiency. Among these methods, Low-Rank Adaptation (LoRA) is a notable approach based on the assumption that the adaptation inherently occurs in a low-dimensional subspace. While it has shown good performance, its implementation requires a fixed and unalterable rank, which might be challenging to select given the unique complexities and requirements of each medical imaging downstream task. Inspired by advancements in natural image processing, we introduce a novel approach for medical image segmentation that dynamically adjusts the intrinsic rank during adaptation. Viewing the low-rank representation of the trainable weight matrices as a singular value decomposition, we introduce an l_1 sparsity regularizer to the loss function, and tackle it with a proximal optimizer. The regularizer could be viewed as a penalty on the decomposition rank. Hence, its minimization enables to find task-adapted ranks automatically. Our method is evaluated in a realistic few-shot fine-tuning setting, where we compare it first to the standard LoRA and then to several other PEFT methods across two distinguishable tasks: base organs and novel organs. Our extensive experiments demonstrate the significant performance improvements driven by our method, highlighting its efficiency and robustness against suboptimal rank initialization. Our code is publicly available: https://github.com/ghassenbaklouti/ARENA
中文摘要:本文提出一种新颖的医学图像分割参数高效微调方法,通过动态调整内在秩实现自适应优化,在多项任务中相比标准LoRA及其他方法展现出更优的性能和自动秩选择能力。
English Summary: This paper introduces a novel parameter-efficient fine-tuning method for medical image segmentation that dynamically adjusts the intrinsic rank during adaptation, outperforming standard LoRA and other methods through automated rank selection and improved performance across different tasks.
Authors:Zihang Ma, Qitian Yin
Abstract:
Graph node classification is a fundamental task in graph neural networks (GNNs), aiming to assign predefined class labels to nodes. On the PubMed citation network dataset, we observe significant classification difficulty disparities, with Category 2 achieving only 74.4% accuracy in traditional GCN, 7.5% lower than Category 1. To address this, we propose a Wasserstein-Rubinstein (WR) distance enhanced Expert Fusion Model (WR-EFM), training specialized GNN models for Categories 0/1 (with layer normalization and residual connections) and Multi-hop Graph Attention Networks (GAT) for Category 2. The WR distance metric optimizes representation similarity between models, particularly focusing on improving Category 2 performance. Our adaptive fusion strategy dynamically weights models based on category-specific performance, with Category 2 assigned a GAT weight of 0.8. WR distance further guides the fusion process by measuring distributional differences between model representations, enabling more principled integration of complementary features.
Experimental results show WR-EFM achieves balanced accuracy across categories: 77.8% (Category 0), 78.0% (Category 1), and 79.9% (Category 2), outperforming both single models and standard fusion approaches. The coefficient of variation (CV) of WR-EFM's category accuracies is 0.013, 77.6% lower than GCN's 0.058, demonstrating superior stability. Notably, WR-EFM improves Category 2 accuracy by 5.5% compared to GCN, verifying the effectiveness of WR-guided fusion in capturing complex structural patterns. This work provides a novel paradigm for handling class-imbalanced graph classification tasks. To promote the research community, we release our project at https://github.com/s010m00n/GASEM4NC.
中文: 提出的Wasserstein-Rubinstein距离增强专家融合模型(WR-EFM)通过为不同类别训练专门模型并采用自适应融合策略,有效解决了图神经网络中的分类差异问题,相比传统GCN将类别2的准确率提升5.5%,并展现出更优的稳定性。
English: The proposed Wasserstein-Rubinstein distance enhanced Expert Fusion Model (WR-EFM) addresses classification disparities in graph neural networks by training specialized models for different categories and using adaptive fusion to achieve balanced performance, improving Category 2 accuracy by 5.5% over traditional GCN with superior stability.
Authors:Felix Köster, Atsushi Uchida
Abstract:
Large Language Models (LLM) have dominated the science and media landscape duo to their impressive performance on processing large chunks of data and produce human-like levels of text. Nevertheless, their huge energy demand and slow processing still a bottleneck for further increasing quality while also making the models accessible to everyone. To solve this bottleneck, we will investigate how reservoir computing performs on natural text processing, which could enable fast and energy efficient hardware implementations. Studies investigating the use of reservoir computing as a language model remain sparse. In this paper, we compare three distinct approaches for character-level language modeling, two different reservoir computing approaches, where only an output layer is trainable, and the well-known transformer-based architectures, which fully learn an attention-based sequence representation. We explore the performance, computational cost and prediction accuracy for both paradigms by equally varying the number of trainable parameters for all models. Using a consistent pipeline for all three approaches, we demonstrate that transformers excel in prediction quality, whereas reservoir computers remain highly efficient reducing the training and inference speed. Furthermore, we investigate two types of reservoir computing: a traditional reservoir with a static linear readout, and an attention-enhanced reservoir that dynamically adapts its output weights via an attention mechanism. Our findings underline how these paradigms scale and offer guidelines to balance resource constraints with performance.
中文: 大语言模型存在能耗高与处理慢的问题,因此研究转向储层计算以实现高效自然文本处理,实验表明储层计算能显著提升效率,而Transformer模型在预测精度上保持优势。
English: Large Language Models face energy and speed limitations, prompting research into reservoir computing for more efficient natural text processing, which shows reduced computational costs while transformers maintain superior accuracy.
Authors:Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs), mainly by shaping higher-order behaviors such as reflection and planning. However, previous RLVR algorithms often apply uniform training signals to all tokens, without considering the different roles of low-entropy knowledge-related tokens and high-entropy reasoning-related tokens. Some recent methods try to separate these token types by gradient masking or asynchronous updates, but these approaches may break semantic dependencies in the model output and hinder effective learning. In this work, we propose Archer, an entropy-aware RLVR approach with dual-token constraints and synchronous updates. Specifically, our method applies weaker KL regularization and higher clipping thresholds to reasoning tokens to encourage exploration, while using stronger constraints on knowledge tokens to maintain factual knowledge. Experimental results on several mathematical reasoning and code generation benchmarks show that our approach significantly outperforms previous RLVR methods, reaching or exceeding state-of-the-art performance among models of comparable size. The code is available at https://github.com/wizard-III/ArcherCodeR.
中文: 本文提出Archer方法,通过双令牌约束和同步更新机制,对知识令牌和推理令牌分别施加不同强度的约束,在数学推理和代码生成任务上显著优于现有强化学习方法。
English: This paper introduces Archer, an entropy-aware reinforcement learning method with verifiable rewards that applies dual-token constraints and synchronous updates to better handle knowledge and reasoning tokens, significantly outperforming previous approaches on mathematical reasoning and code generation benchmarks.
Authors:Feng-Qi Cui, Anyang Tong, Jinyang Huang, Jie Zhang, Dan Guo, Zhi Liu, Meng Wang
Abstract:
Dynamic Facial Expression Recognition (DFER) plays a critical role in affective computing and human-computer interaction. Although existing methods achieve comparable performance, they inevitably suffer from performance degradation under sample heterogeneity caused by multi-source data and individual expression variability. To address these challenges, we propose a novel framework, called Heterogeneity-aware Distributional Framework (HDF), and design two plug-and-play modules to enhance time-frequency modeling and mitigate optimization imbalance caused by hard samples. Specifically, the Time-Frequency Distributional Attention Module (DAM) captures both temporal consistency and frequency robustness through a dual-branch attention design, improving tolerance to sequence inconsistency and visual style shifts. Then, based on gradient sensitivity and information bottleneck principles, an adaptive optimization module Distribution-aware Scaling Module (DSM) is introduced to dynamically balance classification and contrastive losses, enabling more stable and discriminative representation learning. Extensive experiments on two widely used datasets, DFEW and FERV39k, demonstrate that HDF significantly improves both recognition accuracy and robustness. Our method achieves superior weighted average recall (WAR) and unweighted average recall (UAR) while maintaining strong generalization across diverse and imbalanced scenarios. Codes are released at https://github.com/QIcita/HDF_DFER.
中文: 提出的异构感知分布框架(HDF)通过双插件模块增强时频建模并优化损失平衡,在多个数据集上显著提升了动态面部表情识别的准确性和鲁棒性。
English: The proposed Heterogeneity-aware Distributional Framework (HDF) with dual plug-and-play modules enhances time-frequency modeling and optimizes loss balance, significantly improving dynamic facial expression recognition accuracy and robustness across diverse datasets.
Authors:Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang
Abstract:
Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9% while improving accuracy by 2.3%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.
中文摘要:LAPO通过两阶段强化学习使模型内化推理深度控制能力,在数学推理基准测试中实现40.9%的令牌使用减少和2.3%的准确率提升,展现出根据问题复杂度自主分配计算资源的新兴能力。
English Summary: LAPO is a reinforcement learning framework that enables models to intrinsically control reasoning length by learning optimal solution patterns, reducing token usage by 40.9% while improving accuracy by 2.3% through adaptive computational allocation.
Authors:Ruizhe Zhu, Hao Zhu, Yaxuan Li, Syang Zhou, Shijing Cai, Malgorzata Lazuka, Elliott Ash
Abstract:
Collecting human-chatbot dialogues typically demands substantial manual effort and is time-consuming, which limits and poses challenges for research on conversational AI. In this work, we propose DialogueForge - a framework for generating AI-simulated conversations in human-chatbot style. To initialize each generated conversation, DialogueForge uses seed prompts extracted from real human-chatbot interactions. We test a variety of LLMs to simulate the human chatbot user, ranging from state-of-the-art proprietary models to small-scale open-source LLMs, and generate multi-turn dialogues tailored to specific tasks. In addition, we explore fine-tuning techniques to enhance the ability of smaller models to produce indistinguishable human-like dialogues. We evaluate the quality of the simulated conversations and compare different models using the UniEval and GTEval evaluation protocols. Our experiments show that large proprietary models (e.g., GPT-4o) generally outperform others in generating more realistic dialogues, while smaller open-source models (e.g., Llama, Mistral) offer promising performance with greater customization. We demonstrate that the performance of smaller models can be significantly improved by employing supervised fine-tuning techniques. Nevertheless, maintaining coherent and natural long-form human-like dialogues remains a common challenge across all models.
中文摘要:DialogueForge框架通过从真实人机对话中提取种子提示来生成模拟对话,实验表明大型专有模型能产生最逼真的对话,而小型开源模型虽可通过微调提升性能,但所有模型在保持长篇对话连贯性方面仍存在挑战。
English Summary: DialogueForge is a framework that generates AI-simulated human-chatbot conversations using seed prompts from real interactions, with experiments showing large proprietary models produce the most realistic dialogues while smaller open-source models can be enhanced through fine-tuning, though all struggle with maintaining coherent long-form conversations.
Authors:Wei Sun, Weixia Zhang, Linhan Cao, Jun Jia, Xiangyang Zhu, Dandan Zhu, Xiongkuo Min, Guangtao Zhai
Abstract:
Face image quality assessment (FIQA) is essential for various face-related applications. Although FIQA has been extensively studied and achieved significant progress, the computational complexity of FIQA algorithms remains a key concern for ensuring scalability and practical deployment in real-world systems. In this paper, we aim to develop a computationally efficient FIQA method that can be easily deployed in real-world applications. Specifically, our method consists of two stages: training a powerful teacher model and distilling a lightweight student model from it. To build a strong teacher model, we adopt a self-training strategy to improve its capacity. We first train the teacher model using labeled face images, then use it to generate pseudo-labels for a set of unlabeled images. These pseudo-labeled samples are used in two ways: (1) to distill knowledge into the student model, and (2) to combine with the original labeled images to further enhance the teacher model through self-training. The enhanced teacher model is used to further pseudo-label another set of unlabeled images for distilling the student models. The student model is trained using a combination of labeled images, pseudo-labeled images from the original teacher model, and pseudo-labeled images from the enhanced teacher model. Experimental results demonstrate that our student model achieves comparable performance to the teacher model with an extremely low computational overhead. Moreover, our method achieved first place in the ICCV 2025 VQualA FIQA Challenge. The code is available at https://github.com/sunwei925/Efficient-FIQA.git.
中文: 本文提出一种高效的人脸图像质量评估方法,通过两阶段的师生蒸馏框架,在保持低计算成本的同时实现高性能,并荣获ICCV 2025 VQualA FIQA挑战赛冠军。
English: This paper presents an efficient face image quality assessment method using a two-stage teacher-student distillation framework, achieving high performance with low computational cost and winning the ICCV 2025 VQualA FIQA Challenge.
Authors:Haomin Qi, Yuyang Du, Lihao Zhang, Soung Chang Liew, Kexin Chen, Yining Du
Abstract:
Large language models (LLMs) have demonstrated immense potential in computer-aided design (CAD), particularly for automated debugging and verification within electronic design automation (EDA) tools. However, Design for Testability (DFT) remains a relatively underexplored area. This paper presents VeriRAG, the first LLM-assisted DFT-EDA framework. VeriRAG leverages a Retrieval-Augmented Generation (RAG) approach to enable LLM to revise code to ensure DFT compliance. VeriRAG integrates (1) an autoencoder-based similarity measurement model for precise retrieval of reference RTL designs for the LLM, and (2) an iterative code revision pipeline that allows the LLM to ensure DFT compliance while maintaining synthesizability. To support VeriRAG, we introduce VeriDFT, a Verilog-based DFT dataset curated for DFT-aware RTL repairs. VeriRAG retrieves structurally similar RTL designs from VeriDFT, each paired with a rigorously validated correction, as references for code repair. With VeriRAG and VeriDFT, we achieve fully automated DFT correction -- resulting in a 7.72-fold improvement in successful repair rate compared to the zero-shot baseline (Fig. 5 in Section V). Ablation studies further confirm the contribution of each component of the VeriRAG framework. We open-source our data, models, and scripts at https://github.com/yuyangdu01/LLM4DFT.
中文摘要:本文提出首个面向可测试性设计的LLM辅助框架VeriRAG,通过检索增强生成技术自动修正代码以满足DFT规范,相比基线方法将修复成功率提升了7.72倍。
English Summary: This paper introduces VeriRAG, the first LLM-assisted framework for Design for Testability in EDA, which uses a retrieval-augmented generation approach to automatically revise code for DFT compliance and achieves a 7.72-fold improvement in repair success over baseline methods.
Authors:Zuo-Liang Zhu, Jian Yang, Beibei Wang
Abstract:
3D Gaussian splatting (3DGS) has shown its detailed expressive ability and highly efficient rendering speed in the novel view synthesis (NVS) task. The application to inverse rendering still faces several challenges, as the discrete nature of Gaussian primitives makes it difficult to apply geometry constraints. Recent works introduce the signed distance field (SDF) as an extra continuous representation to regularize the geometry defined by Gaussian primitives. It improves the decomposition quality, at the cost of increasing memory usage and complicating training. Unlike these works, we introduce a discretized SDF to represent the continuous SDF in a discrete manner by encoding it within each Gaussian using a sampled value. This approach allows us to link the SDF with the Gaussian opacity through an SDF-to-opacity transformation, enabling rendering the SDF via splatting and avoiding the computational cost of ray marching.The key challenge is to regularize the discrete samples to be consistent with the underlying SDF, as the discrete representation can hardly apply the gradient-based constraints (\eg Eikonal loss). For this, we project Gaussians onto the zero-level set of SDF and enforce alignment with the surface from splatting, namely a projection-based consistency loss. Thanks to the discretized SDF, our method achieves higher relighting quality, while requiring no extra memory beyond GS and avoiding complex manually designed optimization. The experiments reveal that our method outperforms existing Gaussian-based inverse rendering methods. Our code is available at https://github.com/NK-CS-ZZL/DiscretizedSDF.
中文: 本文提出了一种离散化的有向距离场(SDF)方法,将其融入3D高斯泼溅技术中,通过投影一致性损失将SDF与高斯不透明度关联,实现了无需额外内存和复杂优化的高质量逆向渲染。
English: This paper introduces a discretized signed distance field (SDF) integrated into 3D Gaussian splatting, enabling high-quality inverse rendering without extra memory or complex optimization by linking SDF to Gaussian opacity through a projection-based consistency loss.
Authors:David Bann, Ed Lowther, Liam Wright, Yevgeniya Kovalchuk
Abstract:
Recent advances in artificial intelligence (AI) - particularly generative AI - present new opportunities to accelerate, or even automate, epidemiological research. Unlike disciplines based on physical experimentation, a sizable fraction of Epidemiology relies on secondary data analysis and thus is well-suited for such augmentation. Yet, it remains unclear which specific tasks can benefit from AI interventions or where roadblocks exist. Awareness of current AI capabilities is also mixed. Here, we map the landscape of epidemiological tasks using existing datasets - from literature review to data access, analysis, writing up, and dissemination - and identify where existing AI tools offer efficiency gains. While AI can increase productivity in some areas such as coding and administrative tasks, its utility is constrained by limitations of existing AI models (e.g. hallucinations in literature reviews) and human systems (e.g. barriers to accessing datasets). Through examples of AI-generated epidemiological outputs, including fully AI-generated papers, we demonstrate that recently developed agentic systems can now design and execute epidemiological analysis, albeit to varied quality (see https://github.com/edlowther/automated-epidemiology). Epidemiologists have new opportunities to empirically test and benchmark AI systems; realising the potential of AI will require two-way engagement between epidemiologists and engineers.
中文:人工智能的最新进展为流行病学研究提供了从文献综述到成果传播全流程自动化的机遇,但其应用受限于模型缺陷(如文献综述中的幻觉问题)和人类系统障碍。
English: Recent advances in generative AI offer opportunities to automate epidemiological research by mapping tasks from literature review to dissemination, though its effectiveness is limited by model constraints like hallucinations and human system barriers.
Authors:Zihui Gao, Jia-Wang Bian, Guosheng Lin, Hao Chen, Chunhua Shen
Abstract:
Surface reconstruction and novel view rendering from sparse-view images are challenging. Signed Distance Function (SDF)-based methods struggle with fine details, while 3D Gaussian Splatting (3DGS)-based approaches lack global geometry coherence. We propose a novel hybrid method that combines the strengths of both approaches: SDF captures coarse geometry to enhance 3DGS-based rendering, while newly rendered images from 3DGS refine the details of SDF for accurate surface reconstruction. As a result, our method surpasses state-of-the-art approaches in surface reconstruction and novel view synthesis on the DTU and MobileBrick datasets. Code will be released at https://github.com/aim-uofa/SurfaceSplat.
中文: 本研究提出了一种融合SDF全局几何与3DGS细节渲染的混合方法,在基准数据集上实现了卓越的表面重建和新视角合成效果。
English: This study introduces a hybrid method that integrates SDF for global geometry and 3DGS for detailed rendering, achieving superior surface reconstruction and novel view synthesis on benchmark datasets.
Authors:Salah Eddine Bekhouche, Gaby Maroun, Fadi Dornaika, Abdenour Hadid
Abstract:
Medical image segmentation is crucial for many healthcare tasks, including disease diagnosis and treatment planning. One key area is the segmentation of skin lesions, which is vital for diagnosing skin cancer and monitoring patients. In this context, this paper introduces SegDT, a new segmentation model based on diffusion transformer (DiT). SegDT is designed to work on low-cost hardware and incorporates Rectified Flow, which improves the generation quality at reduced inference steps and maintains the flexibility of standard diffusion models. Our method is evaluated on three benchmarking datasets and compared against several existing works, achieving state-of-the-art results while maintaining fast inference speeds. This makes the proposed model appealing for real-world medical applications. This work advances the performance and capabilities of deep learning models in medical image analysis, enabling faster, more accurate diagnostic tools for healthcare professionals. The code is made publicly available at \href{https://github.com/Bekhouche/SegDT}{GitHub}.
中文: 本文提出SegDT模型,这是一种基于扩散变换器的皮肤病变分割方法,在保持低成本硬件快速推理的同时,在多个基准数据集上取得了最先进的性能。
English: This paper introduces SegDT, a diffusion transformer-based model for efficient skin lesion segmentation that achieves state-of-the-art results on benchmark datasets while enabling fast inference on low-cost hardware.
Authors:Hugo Carlesso, Maria Eliza Patulea, Moncef Garouani, Radu Tudor Ionescu, Josiane Mothe
Abstract:
Mixup has become a popular augmentation strategy for image classification, yet its naive pixel-wise interpolation often produces unrealistic images that can hinder learning, particularly in high-stakes medical applications. We propose GeMix, a two-stage framework that replaces heuristic blending with a learned, label-aware interpolation powered by class-conditional GANs. First, a StyleGAN2-ADA generator is trained on the target dataset. During augmentation, we sample two label vectors from Dirichlet priors biased toward different classes and blend them via a Beta-distributed coefficient. Then, we condition the generator on this soft label to synthesize visually coherent images that lie along a continuous class manifold. We benchmark GeMix on the large-scale COVIDx-CT-3 dataset using three backbones (ResNet-50, ResNet-101, EfficientNet-B0). When combined with real data, our method increases macro-F1 over traditional mixup for all backbones, reducing the false negative rate for COVID-19 detection. GeMix is thus a drop-in replacement for pixel-space mixup, delivering stronger regularization and greater semantic fidelity, without disrupting existing training pipelines. We publicly release our code at https://github.com/hugocarlesso/GeMix to foster reproducibility and further research.
中文摘要:GeMix提出了一种基于类别条件生成对抗网络的两阶段框架,通过生成具有语义连续性的逼真插值图像,在COVID-19检测任务中全面优于传统混合增强方法,显著降低了假阴性率并保持即插即用的特性。
English Summary: GeMix introduces a two-stage framework using class-conditional GANs to generate realistic, label-aware interpolated images for medical image classification, outperforming traditional mixup methods across multiple backbones on COVID-19 detection while reducing false negatives.
Authors:Nicolas Poggi, Shashank Agnihotri, Margret Keuper
Abstract:
Terahertz (THz) imaging enables non-invasive analysis for applications such as security screening and material classification, but effective image classification remains challenging due to limited annotations, low resolution, and visual ambiguity. We introduce In-Context Learning (ICL) with Vision-Language Models (VLMs) as a flexible, interpretable alternative that requires no fine-tuning. Using a modality-aligned prompting framework, we adapt two open-weight VLMs to the THz domain and evaluate them under zero-shot and one-shot settings. Our results show that ICL improves classification and interpretability in low-data regimes. This is the first application of ICL-enhanced VLMs to THz imaging, offering a promising direction for resource-constrained scientific domains. Code: \href{https://github.com/Nicolas-Poggi/Project_THz_Classification/tree/main}{GitHub repository}.
中文: 本文提出采用情境学习与视觉语言模型相结合的方法,为太赫兹图像分类提供无需微调的灵活解决方案,在数据稀缺条件下显著提升了分类效果和可解释性。
English: This paper introduces In-Context Learning with Vision-Language Models as a flexible, interpretable method for terahertz image classification, demonstrating improved performance in low-data scenarios without requiring fine-tuning.
Authors:Qinqian Lei, Bo Wang, Robby T. Tan
Abstract:
Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes. In this paper, we introduce HOLa (Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, HOLa decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization. Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting. Our code is available at https://github.com/ChelsieLei/HOLa.
Chinese: 本文提出HOLa方法,通过低秩分解视觉语言模型特征来增强对未见类别的泛化能力并改进动作区分,在零样本人-物交互检测中于HICO-DET数据集上取得了最先进的性能。
English: The paper introduces HOLa, a novel method for zero-shot human-object interaction detection that uses low-rank decomposition of vision-language model features to enhance generalization to unseen classes and improve action distinction, achieving state-of-the-art results on HICO-DET.
Authors:Jongmin Shin, Enki Cho, Ka Young Kim, Jung Yong Kim, Seong Tae Kim, Namkee Oh
Abstract:
Surgical scene understanding is crucial for computer-assisted intervention systems, requiring visual comprehension of surgical scenes that involves diverse elements such as surgical tools, anatomical structures, and their interactions. To effectively represent the complex information in surgical scenes, graph-based approaches have been explored to structurally model surgical entities and their relationships. Previous surgical scene graph studies have demonstrated the feasibility of representing surgical scenes using graphs. However, certain aspects of surgical scenes-such as diverse combinations of tool-action-target and the identity of the hand operating the tool-remain underexplored in graph-based representations, despite their importance. To incorporate these aspects into graph representations, we propose Endoscapes-SG201 dataset, which includes annotations for tool-action-target combinations and hand identity. We also introduce SSG-Com, a graph-based method designed to learn and represent these critical elements. Through experiments on downstream tasks such as critical view of safety assessment and action triplet recognition, we demonstrated the importance of integrating these essential scene graph components, highlighting their significant contribution to surgical scene understanding. The code and dataset are available at https://github.com/ailab-kyunghee/SSG-Com
中文摘要:本研究提出Endoscapes-SG201数据集和SSG-Com方法,通过将工具-动作-目标组合及操作手身份纳入图表示来提升手术场景理解能力,并通过下游任务验证了这些要素的重要价值。
English Summary: This study introduces the Endoscapes-SG201 dataset and SSG-Com method to enhance surgical scene understanding by incorporating tool-action-target combinations and hand identity into graph representations, demonstrating their critical value through downstream task evaluations.
Authors:Simon Winther Albertsen, Hjalte Svaneborg Bjørnstrup, Mostafa Mehdipour Ghazi
Abstract:
Accurate segmentation is crucial for clinical applications, but existing models often assume fixed, high-resolution inputs and degrade significantly when faced with lower-resolution data in real-world scenarios. To address this limitation, we propose RARE-UNet, a resolution-aware multi-scale segmentation architecture that dynamically adapts its inference path to the spatial resolution of the input. Central to our design are multi-scale blocks integrated at multiple encoder depths, a resolution-aware routing mechanism, and consistency-driven training that aligns multi-resolution features with full-resolution representations. We evaluate RARE-UNet on two benchmark brain imaging tasks for hippocampus and tumor segmentation. Compared to standard UNet, its multi-resolution augmented variant, and nnUNet, our model achieves the highest average Dice scores of 0.84 and 0.65 across resolution, while maintaining consistent performance and significantly reduced inference time at lower resolutions. These results highlight the effectiveness and scalability of our architecture in achieving resolution-robust segmentation. The codes are available at: https://github.com/simonsejse/RARE-UNet.
Chinese: RARE-UNet是一种分辨率自适应的分割架构,能根据输入分辨率动态调整推理路径,相比现有模型在不同分辨率下均获得更高的Dice分数并显著减少推理时间。
English: RARE-UNet is a resolution-aware segmentation architecture that dynamically adapts to input resolutions, achieving superior Dice scores and faster inference across varied resolutions compared to existing models.
Authors:Hanting Li, Fei Zhou, Xin Sun, Yang Hua, Jungong Han, Liang-Jie Zhang
Abstract:
Recent Transformer-based low-light enhancement methods have made promising progress in recovering global illumination. However, they still struggle with non-uniform lighting scenarios, such as backlit and shadow, appearing as over-exposure or inadequate brightness restoration. To address this challenge, we present a Spatially-Adaptive Illumination-Guided Transformer (SAIGFormer) framework that enables accurate illumination restoration. Specifically, we propose a dynamic integral image representation to model the spatially-varying illumination, and further construct a novel Spatially-Adaptive Integral Illumination Estimator ($\text{SAI}^2\text{E}$). Moreover, we introduce an Illumination-Guided Multi-head Self-Attention (IG-MSA) mechanism, which leverages the illumination to calibrate the lightness-relevant features toward visual-pleased illumination enhancement. Extensive experiments on five standard low-light datasets and a cross-domain benchmark (LOL-Blur) demonstrate that our SAIGFormer significantly outperforms state-of-the-art methods in both quantitative and qualitative metrics. In particular, our method achieves superior performance in non-uniform illumination enhancement while exhibiting strong generalization capabilities across multiple datasets. Code is available at https://github.com/LHTcode/SAIGFormer.git.
中文: 提出的SAIGFormer框架通过空间自适应照明估计器和照明引导注意力机制,有效解决了低光增强中的非均匀照明问题,在多个数据集上实现了最先进的性能表现。
English: The proposed SAIGFormer framework effectively addresses non-uniform lighting challenges in low-light enhancement by employing a spatially-adaptive illumination estimator and an illumination-guided attention mechanism, achieving state-of-the-art performance across multiple datasets.
Authors:Sizhou Chen, Shufan Jiang, Chi Zhang, Xiao-Lei Zhang, Xuelong Li
Abstract:
Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language model (LLM) is providing a new path to achieve this goal. However, existing LLM-based drama generation methods often result in AI agents that lack initiative and cannot interact with the physical environment. Furthermore, these methods typically require detailed user input to drive the drama. These limitations reduce the interactivity and immersion of online real-time performance. To address the above challenges, we propose HAMLET, a multi-agent framework focused on drama creation and online performance. Given a simple topic, the framework generates a narrative blueprint, guiding the subsequent improvisational performance. During the online performance, each actor is given an autonomous mind. This means that actors can make independent decisions based on their own background, goals, and emotional state. In addition to conversations with other actors, their decisions can also change the state of scene props through actions such as opening a letter or picking up a weapon. The change is then broadcast to other related actors, updating what they know and care about, which in turn influences their next action. To evaluate the quality of drama performance, we designed an evaluation method to assess three primary aspects, including character performance, narrative quality, and interaction experience. The experimental evaluation shows that HAMLET can create expressive and coherent theatrical experiences. Our code, dataset and models are available at https://github.com/HAMLET-2025/HAMLET.
中文: HAMLET是一个多智能体框架,通过赋予演员自主决策能力并与角色及场景道具互动,生成具有表现力和连贯性的沉浸式戏剧体验,提升了线上演出的互动性。
English: HAMLET is a multi-agent framework that generates immersive theatrical experiences by enabling autonomous actors to make decisions and interact with both characters and physical props, enhancing interactivity and coherence in online performances.
Authors:Kaiyan Chang, Yonghao Shi, Chenglong Wang, Hang Zhou, Chi Hu, Xiaoqian Liu, Yingfeng Luo, Yuan Ge, Tong Xiao, Jingbo Zhu
Abstract:
Test-Time Scaling (TTS) is a promising approach to progressively elicit the model's intelligence during inference. Recently, training-based TTS methods, such as continued reinforcement learning (RL), have further surged in popularity, while training-free TTS methods are gradually fading from prominence. However, the additional computation overhead of training amplifies the burden on test-time scaling. In this paper, we focus on training-free TTS methods for reasoning. We first design Conditional Step-level Self-refinement, a fine-grained sequential scaling method guided by process verification. On top of its effectiveness, we further combine it with other classical parallel scaling methods at the step level, to introduce a novel inference paradigm called Hybrid Test-Time Scaling. Extensive experiments on five instruction-tuned LLMs across different scales (3B-14B) and families demonstrate that hybrid strategy incorporating various training-free TTS methods at a fine granularity has considerable potential for expanding the reasoning performance boundaries of LLMs.
Chinese: 混合测试时缩放通过结合细粒度的顺序与并行无训练方法,无需额外计算负担即可显著提升大语言模型的推理性能边界。
English: Hybrid Test-Time Scaling combines fine-grained sequential and parallel training-free methods to significantly enhance reasoning performance in large language models without additional computational overhead.
Authors:Johannes Ackermann, Takashi Ishida, Masashi Sugiyama
Abstract:
Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised fine-tuning, sample pairs of responses, obtain human feedback, and use the resulting data to train a reward model (RM). RL methods are then used to train the LM to maximize the reward given by the RM. As training progresses, the responses generated by the LM no longer resemble the responses seen by the RM during training, leading to the RM becoming inaccurate. The score given by the RM keeps increasing, but the learned behavior no longer matches the human preferences. This issue is known as overoptimization. We investigate overoptimization from the point of view of distribution shift and show that the shift results in an inconsistent estimate of the RM parameters, leading to an inconsistent estimate of the policy gradient. We propose Off-Policy Corrected Reward Modeling (OCRM), which iteratively off-policy corrects the RM using importance weighting, without requiring new labels or samples. This results in a more accurate RM, which empirically leads to an improved final policy. We validate our approach in experiments with summarization and chatbot datasets and show that it performs significantly better than standard RLHF methods and baselines. Our implementation is available at https://github.com/JohannesAck/OffPolicyCorrectedRewardModeling
中文摘要:基于人类反馈的强化学习(RLHF)训练语言模型以符合人类偏好,但存在因分布偏移导致的过度优化问题,提出的离策略校正奖励建模(OCRM)方法通过重要性加权校正奖励模型,显著提升了模型性能。
English summary: Reinforcement Learning from Human Feedback (RLHF) trains language models to align with human preferences but faces overoptimization due to distribution shift, which is addressed by the proposed Off-Policy Corrected Reward Modeling (OCRM) method for improved performance.
Authors:Deyu Zhang, Tingting Long, Jinrui Zhang, Ligeng Chen, Ju Ren, Yaoxue Zhang
Abstract:
Enabling efficient text-video retrieval on edge-end devices is critical for real-world applications. Yet, existing methods face a critical challenge in balancing accuracy and computational efficiency: uniform frame sampling methods ensure content coverage but incur prohibitive computational costs, while salient-frame sampling methods reduce overhead but suffer from query-agnostic frame selection that biases retrieval results. To address this, we propose ProCLIP, a user-centric framework that achieves state-of-the-art accuracy with significantly improved efficiency. We design a prompt-aware frame sampling strategy that dynamically guides lightweight feature extractors using textual prompts to select semantically relevant frames, overcoming the limitations of existing salient-frame sampling methods which rely on static, query-agnostic selection criteria. Moreover, we adopt a two-stage candidate pruning strategy that combines rapid coarse filtering via a lightweight module with CLIP-powered fine-grained re-ranking, enhancing retrieval efficiency while preserving accuracy. Experiments across benchmarks show ProCLIP achieves 75.3% latency reduction versus baselines while maintaining competitive accuracy, i.e., R@1=49.0 in MSR-VTT dataset. Code is available at https://github.com/tiffylong/ProCLIP.
Chinese: ProCLIP框架通过提示感知的帧采样策略和两阶段候选剪枝方法,在保持检索精度的同时将计算延迟显著降低75.3%,有效解决了文本-视频检索中效率与精度的平衡难题。
English: ProCLIP introduces a prompt-aware frame sampling strategy and a two-stage candidate pruning approach to significantly reduce computational latency by 75.3% while maintaining competitive retrieval accuracy in text-video retrieval systems.
Authors:Liang Chen, Ghazi Shazan Ahmad, Tianjun Yao, Lingqiao Liu, Zhiqiang Shen
Abstract:
Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \emph{\ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective \textbf{R}ational \textbf{Ada}ptaion ({RAda}) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings (i.e., updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings. Code is available at \href{https://github.com/khufia/RAda/tree/main}{github.com/khufia/RAda}.
中文: 提出的RAda方法通过可学习的掩码动态校准融合的跨模态表征,以最小计算成本显著提升视觉语言模型在不同适配场景下的微调性能。
English: The proposed RAda method enhances vision-language model fine-tuning by dynamically calibrating the fused cross-modal representation through a learned mask, achieving versatile performance improvements across different adaptation settings with minimal computational cost.
Authors:Ka Young Kim, Hyeon Bae Kim, Seong Tae Kim
Abstract:
Surgical phase recognition plays a crucial role in surgical workflow analysis, enabling various applications such as surgical monitoring, skill assessment, and workflow optimization. Despite significant advancements in deep learning-based surgical phase recognition, these models remain inherently opaque, making it difficult to understand how they make decisions. This lack of interpretability hinders trust and makes it challenging to debug the model. To address this challenge, we propose SurgX, a novel concept-based explanation framework that enhances the interpretability of surgical phase recognition models by associating neurons with relevant concepts. In this paper, we introduce the process of selecting representative example sequences for neurons, constructing a concept set tailored to the surgical video dataset, associating neurons with concepts and identifying neurons crucial for predictions. Through extensive experiments on two surgical phase recognition models, we validate our method and analyze the explanation for prediction. This highlights the potential of our method in explaining surgical phase recognition. The code is available at https://github.com/ailab-kyunghee/SurgX
中文: 手术阶段识别对分析手术流程至关重要,但现有深度学习模型缺乏可解释性,因此作者提出SurgX这一基于概念的解释框架,通过将神经元与相关概念关联来增强模型透明度,并通过实验验证了其有效性。
English: Surgical phase recognition is essential for analyzing surgical workflows but current deep learning models lack interpretability, so the authors propose SurgX, a concept-based explanation framework that enhances model transparency by linking neurons to relevant concepts and validating its effectiveness through experiments.
Authors:Huiyu Zhai, Xingxing Yang, Yalan Ye, Chenyang Li, Bin Fan, Changze Li
Abstract:
Facial expression recognition (FER) is a challenging task due to pervasive occlusion and dataset biases. Especially when facial information is partially occluded, existing FER models struggle to extract effective facial features, leading to inaccurate classifications. In response, we present ORSANet, which introduces the following three key contributions: First, we introduce auxiliary multi-modal semantic guidance to disambiguate facial occlusion and learn high-level semantic knowledge, which is two-fold: 1) we introduce semantic segmentation maps as dense semantics prior to generate semantics-enhanced facial representations; 2) we introduce facial landmarks as sparse geometric prior to mitigate intrinsic noises in FER, such as identity and gender biases. Second, to facilitate the effective incorporation of these two multi-modal priors, we customize a Multi-scale Cross-interaction Module (MCM) to adaptively fuse the landmark feature and semantics-enhanced representations within different scales. Third, we design a Dynamic Adversarial Repulsion Enhancement Loss (DARELoss) that dynamically adjusts the margins of ambiguous classes, further enhancing the model's ability to distinguish similar expressions. We further construct the first occlusion-oriented FER dataset to facilitate specialized robustness analysis on various real-world occlusion conditions, dubbed Occlu-FER. Extensive experiments on both public benchmarks and Occlu-FER demonstrate that our proposed ORSANet achieves SOTA recognition performance. Code is publicly available at https://github.com/Wenyuzhy/ORSANet-master.
中文: ORSANet通过引入多模态语义引导、多尺度交互模块和动态对抗排斥损失,有效解决了面部表情识别中的遮挡和数据集偏差问题,在公开基准和新建的Occlu-FER数据集上均实现了最优识别性能。
English: ORSANet addresses facial expression recognition challenges from occlusion and dataset biases by integrating multi-modal semantic guidance, a multi-scale fusion module, and a dynamic loss function, achieving state-of-the-art performance on both public benchmarks and the newly introduced Occlu-FER dataset.
Authors:Hengyu Zhang, Chunxu Shen, Xiangguo Sun, Jie Tan, Yanchao Tan, Yu Rong, Hong Cheng, Lingling Yi
Abstract:
In real-world recommendation scenarios, users typically engage with platforms through multiple types of behavioral interactions. Multi-behavior recommendation algorithms aim to leverage various auxiliary user behaviors to enhance prediction for target behaviors of primary interest (e.g., buy), thereby overcoming performance limitations caused by data sparsity in target behavior records. Current state-of-the-art approaches typically employ hierarchical design following either cascading (e.g., view$\rightarrow$cart$\rightarrow$buy) or parallel (unified$\rightarrow$behavior$\rightarrow$specific components) paradigms, to capture behavioral relationships. However, these methods still face two critical challenges: (1) severe distribution disparities across behaviors, and (2) negative transfer effects caused by noise in auxiliary behaviors. In this paper, we propose a novel model-agnostic Hierarchical Graph Information Bottleneck (HGIB) framework for multi-behavior recommendation to effectively address these challenges. Following information bottleneck principles, our framework optimizes the learning of compact yet sufficient representations that preserve essential information for target behavior prediction while eliminating task-irrelevant redundancies. To further mitigate interaction noise, we introduce a Graph Refinement Encoder (GRE) that dynamically prunes redundant edges through learnable edge dropout mechanisms. We conduct comprehensive experiments on three real-world public datasets, which demonstrate the superior effectiveness of our framework. Beyond these widely used datasets in the academic community, we further expand our evaluation on several real industrial scenarios and conduct an online A/B testing, showing again a significant improvement in multi-behavior recommendations. The source code of our proposed HGIB is available at https://github.com/zhy99426/HGIB.
中文: 本文提出了一种分层图信息瓶颈框架,通过优化学习紧凑表示并引入图细化编码器来消除冗余,有效解决了多行为推荐中的分布差异和负迁移问题。
English: This paper introduces a Hierarchical Graph Information Bottleneck (HGIB) framework that addresses distribution disparities and negative transfer in multi-behavior recommendation by learning compact representations and incorporating a Graph Refinement Encoder to reduce interaction noise.
Authors:Julia Machnio, Mads Nielsen, Mostafa Mehdipour Ghazi
Abstract:
Active learning (AL) seeks to reduce annotation costs by selecting the most informative samples for labeling, making it particularly valuable in resource-constrained settings. However, traditional evaluation methods, which focus solely on final accuracy, fail to capture the full dynamics of the learning process. To address this gap, we propose PALM (Performance Analysis of Active Learning Models), a unified and interpretable mathematical model that characterizes AL trajectories through four key parameters: achievable accuracy, coverage efficiency, early-stage performance, and scalability. PALM provides a predictive description of AL behavior from partial observations, enabling the estimation of future performance and facilitating principled comparisons across different strategies. We validate PALM through extensive experiments on CIFAR-10/100 and ImageNet-50/100/200, covering a wide range of AL methods and self-supervised embeddings. Our results demonstrate that PALM generalizes effectively across datasets, budgets, and strategies, accurately predicting full learning curves from limited labeled data. Importantly, PALM reveals crucial insights into learning efficiency, data space coverage, and the scalability of AL methods. By enabling the selection of cost-effective strategies and predicting performance under tight budget constraints, PALM lays the basis for more systematic, reproducible, and data-efficient evaluation of AL in both research and real-world applications. The code is available at: https://github.com/juliamachnio/PALM.
中文: PALM提出了一种统一的数学模型,通过四个关键参数描述主动学习轨迹,能在不同数据集和预算下实现性能预测与策略比较。
English: PALM introduces a unified mathematical model to characterize active learning trajectories through four key parameters, enabling performance prediction and strategy comparison across diverse datasets and budgets.
Authors:Zhiyu Pan, Xiongjun Guan, Yongjie Duan, Jianjiang Feng, Jie Zhou
Abstract:
Fingerprint matching under diverse capture conditions remains a fundamental challenge in biometric recognition. To achieve robust and accurate performance in such scenarios, we propose DMD, a minutiae-anchored local dense representation which captures both fine-grained ridge textures and discriminative minutiae features in a spatially structured manner. Specifically, descriptors are extracted from local patches centered and oriented on each detected minutia, forming a three-dimensional tensor, where two dimensions represent spatial locations on the fingerprint plane and the third encodes semantic features. This representation explicitly captures abstract features of local image patches, enabling a multi-level, fine-grained description that aggregates information from multiple minutiae and their surrounding ridge structures. Furthermore, thanks to its strong spatial correspondence with the patch image, DMD allows for the use of foreground segmentation masks to identify valid descriptor regions. During matching, comparisons are then restricted to overlapping foreground areas, improving efficiency and robustness. Extensive experiments on rolled, plain, parital, contactless, and latent fingerprint datasets demonstrate the effectiveness and generalizability of the proposed method. It achieves state-of-the-art accuracy across multiple benchmarks while maintaining high computational efficiency, showing strong potential for large-scale fingerprint recognition. Corresponding code is available at https://github.com/Yu-Yy/DMD.
中文: 提出的DMD方法采用基于细节点的密集表示,通过三维张量结构化地捕捉脊线纹理和细节点特征,并利用前景约束匹配在多种指纹数据集上实现了最优的准确率和效率。
English: The proposed DMD method introduces a minutiae-anchored dense representation that captures both ridge textures and minutiae features in a structured 3D tensor, achieving state-of-the-art accuracy and efficiency across diverse fingerprint datasets through foreground-constrained matching.
Authors:Emile Anand, Sarah Liaw
Abstract:
Thompson Sampling (TS) is widely used to address the exploration/exploitation tradeoff in contextual bandits, yet recent theory shows that it does not explore aggressively enough in high-dimensional problems. Feel-Good Thompson Sampling (FG-TS) addresses this by adding an optimism bonus that biases toward high-reward models, and it achieves the asymptotically minimax-optimal regret in the linear setting when posteriors are exact. However, its performance with \emph{approximate} posteriors -- common in large-scale or neural problems -- has not been benchmarked. We provide the first systematic study of FG-TS and its smoothed variant (SFG-TS) across eleven real-world and synthetic benchmarks. To evaluate their robustness, we compare performance across settings with exact posteriors (linear and logistic bandits) to approximate regimes produced by fast but coarse stochastic-gradient samplers. Ablations over preconditioning, bonus scale, and prior strength reveal a trade-off: larger bonuses help when posterior samples are accurate, but hurt when sampling noise dominates. FG-TS generally outperforms vanilla TS in linear and logistic bandits, but tends to be weaker in neural bandits. Nevertheless, because FG-TS and its variants are competitive and easy-to-use, we recommend them as baselines in modern contextual-bandit benchmarks. Finally, we provide source code for all our experiments in https://github.com/SarahLiaw/ctx-bandits-mcmc-showdown.
Chinese: Feel-Good汤普森采样通过乐观奖励增强情境赌博机中的探索,在线性和逻辑场景中优于标准方法,但在神经网络场景中表现较弱,尽管对近似后验敏感,仍被推荐为基准算法。
English: Feel-Good Thompson Sampling enhances exploration in contextual bandits with an optimism bonus, outperforming standard methods in linear and logistic settings but showing limitations in neural bandits, making it a recommended baseline despite sensitivity to approximate posteriors.
Authors:Navid Ayoobi, Sadat Shahriar, Arjun Mukherjee
Abstract:
We present a novel evaluation paradigm for AI text detectors that prioritizes real-world and equitable assessment. Current approaches predominantly report conventional metrics like AUROC, overlooking that even modest false positive rates constitute a critical impediment to practical deployment of detection systems. Furthermore, real-world deployment necessitates predetermined threshold configuration, making detector stability (i.e. the maintenance of consistent performance across diverse domains and adversarial scenarios), a critical factor. These aspects have been largely ignored in previous research and benchmarks. Our benchmark, SHIELD, addresses these limitations by integrating both reliability and stability factors into a unified evaluation metric designed for practical assessment. Furthermore, we develop a post-hoc, model-agnostic humanification framework that modifies AI text to more closely resemble human authorship, incorporating a controllable hardness parameter. This hardness-aware approach effectively challenges current SOTA zero-shot detection methods in maintaining both reliability and stability. (Data and code: https://github.com/navid-aub/SHIELD-Benchmark)
中文: SHIELD提出了一种新颖的AI文本检测器评估范式,将可靠性和稳定性指标整合用于实际部署,并开发了人化框架以挑战现有检测方法。
English: SHIELD introduces a novel AI text detector evaluation paradigm that integrates reliability and stability metrics for practical deployment, alongside a humanification framework to challenge current detection methods.
Authors:Haichao Liu, Haoren Guo, Pei Liu, Benshan Ma, Yuxiang Zhang, Jun Ma, Tong Heng Lee
Abstract:
Scene understanding and risk-aware attentions are crucial for human drivers to make safe and effective driving decisions. To imitate this cognitive ability in urban autonomous driving while ensuring the transparency and interpretability, we propose a vision-language model (VLM)-enhanced unified decision-making and motion control framework, named VLM-UDMC. This framework incorporates scene reasoning and risk-aware insights into an upper-level slow system, which dynamically reconfigures the optimal motion planning for the downstream fast system. The reconfiguration is based on real-time environmental changes, which are encoded through context-aware potential functions. More specifically, the upper-level slow system employs a two-step reasoning policy with Retrieval-Augmented Generation (RAG), leveraging foundation models to process multimodal inputs and retrieve contextual knowledge, thereby generating risk-aware insights. Meanwhile, a lightweight multi-kernel decomposed LSTM provides real-time trajectory predictions for heterogeneous traffic participants by extracting smoother trend representations for short-horizon trajectory prediction. The effectiveness of the proposed VLM-UDMC framework is verified via both simulations and real-world experiments with a full-size autonomous vehicle. It is demonstrated that the presented VLM-UDMC effectively leverages scene understanding and attention decomposition for rational driving decisions, thus improving the overall urban driving performance. Our open-source project is available at https://github.com/henryhcliu/vlmudmc.git.
中文: 本文提出了VLM-UDMC框架,通过视觉语言模型整合场景推理和风险感知,动态优化自动驾驶车辆的运动规划,其有效性已通过仿真和实车实验验证。
English: This paper introduces VLM-UDMC, a vision-language model-enhanced framework that integrates scene reasoning and risk-aware insights to dynamically reconfigure motion planning for autonomous vehicles, validated through simulations and real-world tests.
Authors:Zhaochen Guo, Zhixiang Shen, Xuanting Xie, Liangjian Wen, Zhao Kang
Abstract:
Multimodal graphs, which integrate unstructured heterogeneous data with structured interconnections, offer substantial real-world utility but remain insufficiently explored in unsupervised learning. In this work, we initiate the study of multimodal graph clustering, aiming to bridge this critical gap. Through empirical analysis, we observe that real-world multimodal graphs often exhibit hybrid neighborhood patterns, combining both homophilic and heterophilic relationships. To address this challenge, we propose a novel framework -- \textsc{Disentangled Multimodal Graph Clustering (DMGC)} -- which decomposes the original hybrid graph into two complementary views: (1) a homophily-enhanced graph that captures cross-modal class consistency, and (2) heterophily-aware graphs that preserve modality-specific inter-class distinctions. We introduce a \emph{Multimodal Dual-frequency Fusion} mechanism that jointly filters these disentangled graphs through a dual-pass strategy, enabling effective multimodal integration while mitigating category confusion. Our self-supervised alignment objectives further guide the learning process without requiring labels. Extensive experiments on both multimodal and multi-relational graph datasets demonstrate that DMGC achieves state-of-the-art performance, highlighting its effectiveness and generalizability across diverse settings. Our code is available at https://github.com/Uncnbb/DMGC.
中文: 本文提出的DMGC框架通过将混合邻域模式分解为同质性增强图和异质性感知图,采用自监督学习无需标签即可实现多模态图聚类,并在多种数据集上取得了最优性能。
English: This paper introduces DMGC, a novel framework for multimodal graph clustering that disentangles hybrid neighborhood patterns into homophily-enhanced and heterophily-aware graphs, achieving state-of-the-art performance through self-supervised learning without requiring labels.
Authors:Yanbing Zhang, Zhe Wang, Qin Zhou, Mengping Yang
Abstract:
In light of recent breakthroughs in text-to-image (T2I) generation, particularly with diffusion transformers (DiT), subject-driven technologies are increasingly being employed for high-fidelity customized production that preserves subject identity from reference inputs, enabling thrilling design workflows and engaging entertainment. Existing alternatives typically require either per-subject optimization via trainable text embeddings or training specialized encoders for subject feature extraction on large-scale datasets. Such dependencies on training procedures fundamentally constrain their practical applications. More importantly, current methodologies fail to fully leverage the inherent zero-shot potential of modern diffusion transformers (e.g., the Flux series) for authentic subject-driven synthesis. To bridge this gap, we propose FreeCus, a genuinely training-free framework that activates DiT's capabilities through three key innovations: 1) We introduce a pivotal attention sharing mechanism that captures the subject's layout integrity while preserving crucial editing flexibility. 2) Through a straightforward analysis of DiT's dynamic shifting, we propose an upgraded variant that significantly improves fine-grained feature extraction. 3) We further integrate advanced Multimodal Large Language Models (MLLMs) to enrich cross-modal semantic representations. Extensive experiments reflect that our method successfully unlocks DiT's zero-shot ability for consistent subject synthesis across diverse contexts, achieving state-of-the-art or comparable results compared to approaches that require additional training. Notably, our framework demonstrates seamless compatibility with existing inpainting pipelines and control modules, facilitating more compelling experiences. Our code is available at: https://github.com/Monalissaa/FreeCus.
中文摘要:FreeCus是一种无需训练的框架,通过注意力共享机制、动态偏移分析和多模态大语言模型集成,激活扩散变换器的零样本能力实现真实的主体驱动图像生成,在不需额外训练的情况下达到领先水平。
English Summary: FreeCus is a training-free framework that activates diffusion transformers' zero-shot capabilities for authentic subject-driven image synthesis through attention sharing, dynamic shifting analysis, and MLLM integration, achieving state-of-the-art results without requiring additional training.
Authors:Xiaofeng Shi, Yuduo Li, Qian Kou, Longbin Yu, Jinxin Xie, Hua Zhou
Abstract:
Recent advances in large language models (LLMs) have opened new opportunities for academic literature retrieval. However, existing systems often rely on rigid pipelines and exhibit limited reasoning capabilities. We introduce SPAR, a multi-agent framework that incorporates RefChain-based query decomposition and query evolution to enable more flexible and effective search. To facilitate systematic evaluation, we also construct SPARBench, a challenging benchmark with expert-annotated relevance labels. Experimental results demonstrate that SPAR substantially outperforms strong baselines, achieving up to +56% F1 on AutoScholar and +23% F1 on SPARBench over the best-performing baseline. Together, SPAR and SPARBench provide a scalable, interpretable, and high-performing foundation for advancing research in scholarly retrieval. Code and data will be available at: https://github.com/xiaofengShi/SPAR
中文:SPAR采用多智能体框架,结合RefChain查询分解与进化技术,显著提升学术文献检索性能,在基准测试中F1值最高超出基线56%,并构建了SPARBench作为系统评估基准。
English: SPAR introduces a multi-agent framework with RefChain query decomposition and evolution to enhance academic literature retrieval, significantly outperforming baselines by up to +56% F1 and establishing SPARBench as a robust evaluation benchmark.
Authors:Naeem Paeedeh, Mahardhika Pratama, Wolfgang Mayer, Jimmy Cao, Ryszard Kowlczyk
Abstract:
Despite the progress in Cross-Domain Few-Shot Learning (CD-FSL), a model pre-trained with DINO combined with a prototypical classifier outperforms the latest SOTA methods. A crucial limitation that needs to be overcome is that updating too many parameters of the transformers leads to overfitting due to the scarcity of labeled samples. To address this challenge, we propose a new concept, Coalescent Projection (CP), as an effective successor to soft prompts. Additionally, we propose a novel pseudo-class generation method combined with Self-Supervised Transformations (SSTs) that relies solely on the base domain to prepare the network for encountering unseen samples from different domains. The proposed method exhibits its effectiveness in comprehensive experiments on the extreme domain shift scenario of the BSCD-FSL benchmark. Our code is published at https://github.com/Naeem-Paeedeh/CPLSR.
中文: 本研究提出融合投影和结合自监督变换的伪类生成方法,以解决跨域少样本学习中的过拟合问题,在BSCD-FSL基准测试中展现出卓越性能。
English: The study introduces Coalescent Projection and a pseudo-class generation method with Self-Supervised Transformations to overcome overfitting in Cross-Domain Few-Shot Learning, demonstrating superior performance on the BSCD-FSL benchmark.
Authors:Le Peng, Yash Travadi, Chuan He, Ying Cui, Ju Sun
Abstract:
For classification with imbalanced class frequencies, i.e., imbalanced classification (IC), standard accuracy is known to be misleading as a performance measure. While most existing methods for IC resort to optimizing balanced accuracy (i.e., the average of class-wise recalls), they fall short in scenarios where the significance of classes varies or certain metrics should reach prescribed levels. In this paper, we study two key classification metrics, precision and recall, under three practical binary IC settings: fix precision optimize recall (FPOR), fix recall optimize precision (FROP), and optimize $F_β$-score (OFBS). Unlike existing methods that rely on smooth approximations to deal with the indicator function involved, \textit{we introduce, for the first time, exact constrained reformulations for these direct metric optimization (DMO) problems}, which can be effectively solved by exact penalty methods. Experiment results on multiple benchmark datasets demonstrate the practical superiority of our approach over the state-of-the-art methods for the three DMO problems. We also expect our exact reformulation and optimization (ERO) framework to be applicable to a wide range of DMO problems for binary IC and beyond. Our code is available at https://github.com/sun-umn/DMO.
中文摘要:本文针对不平衡分类中的直接指标优化问题,首次提出了精确约束重构方法,通过惩罚算法实现对精确率和召回率的精准控制,并在三种实际场景中验证了其优于现有方法的性能。
English Summary: This paper introduces exact constrained reformulations for direct metric optimization in imbalanced classification, enabling precise control over precision and recall through penalty methods and demonstrating superior performance across three practical scenarios.
Authors:Siqi Chen, Guoqing Zhang, Jiahao Lai, Bingzhi Shen, Sihong Zhang, Caixia Dong, Xuejin Chen, Yang Li
Abstract:
Advancements in 3D vision have increased the impact of blood vessel modeling on medical applications. However, accurately representing the complex geometry and topology of blood vessels remains a challenge due to their intricate branching patterns, curvatures, and irregular shapes. In this study, we propose a hierarchical part-based frame work for 3D vessel generation that separates the global binary tree-like topology from local geometric details. Our approach proceeds in three stages: (1) key graph generation to model the overall hierarchical struc ture, (2) vessel segment generation conditioned on geometric properties, and (3) hierarchical vessel assembly by integrating the local segments according to the global key graph. We validate our framework on real world datasets, demonstrating superior performance over existing methods in modeling complex vascular networks. This work marks the first successful application of a part-based generative approach for 3D vessel modeling, setting a new benchmark for vascular data generation. The code is available at: https://github.com/CybercatChen/PartVessel.git.
中文: 本研究提出了一种分层部件式框架用于三维血管生成,将全局拓扑与局部几何细节分离,在真实数据集上验证了其优越性能,为血管建模设立了新标杆。
English: This study introduces a hierarchical part-based framework for 3D blood vessel generation that separates global topology from local geometry, achieving superior performance on real-world datasets and setting a new benchmark in vascular modeling.
Authors:Justin Turnau, Longchao Da, Khoa Vo, Ferdous Al Rafi, Shreyas Bachiraju, Tiejin Chen, Hua Wei
Abstract:
Traffic Signal Control (TSC) is essential for managing urban traffic flow and reducing congestion. Reinforcement Learning (RL) offers an adaptive method for TSC by responding to dynamic traffic patterns, with multi-agent RL (MARL) gaining traction as intersections naturally function as coordinated agents. However, due to shifts in environmental dynamics, implementing MARL-based TSC policies in the real world often leads to a significant performance drop, known as the sim-to-real gap. Grounded Action Transformation (GAT) has successfully mitigated this gap in single-agent RL for TSC, but real-world traffic networks, which involve numerous interacting intersections, are better suited to a MARL framework. In this work, we introduce JL-GAT, an application of GAT to MARL-based TSC that balances scalability with enhanced grounding capability by incorporating information from neighboring agents. JL-GAT adopts a decentralized approach to GAT, allowing for the scalability often required in real-world traffic networks while still capturing key interactions between agents. Comprehensive experiments on various road networks under simulated adverse weather conditions, along with ablation studies, demonstrate the effectiveness of JL-GAT. The code is publicly available at https://github.com/DaRL-LibSignal/JL-GAT/.
中文摘要:本文提出JL-GAT方法,将接地动作转换应用于多智能体强化学习的交通信号控制,通过邻域智能体信息交互,在保持系统扩展性的同时有效解决了仿真与现实间的性能差距问题。
English summary: This paper introduces JL-GAT, a decentralized method applying Grounded Action Transformation to multi-agent reinforcement learning for traffic signal control, which effectively addresses the sim-to-real gap while maintaining scalability in complex urban networks.
Authors:Juntong Ni, Shiyu Wang, Zewen Liu, Xiaoming Shi, Xinyue Zhong, Zhou Ye, Wei Jin
Abstract:
Time series forecasting (TSF) is a central problem in time series analysis. However, as the number of channels in time series datasets scales to the thousands or more, a scenario we define as High-Dimensional Time Series Forecasting (HDTSF), it introduces significant new modeling challenges that are often not the primary focus of traditional TSF research. HDTSF is challenging because the channel correlation often forms complex and hierarchical patterns. Existing TSF models either ignore these interactions or fail to scale as dimensionality grows. To address this issue, we propose U-Cast, a channel-dependent forecasting architecture that learns latent hierarchical channel structures with an innovative query-based attention. To disentangle highly correlated channel representation, U-Cast adds a full-rank regularization during training. We also release Time-HD, the first benchmark of large, diverse, high-dimensional datasets. Our theory shows that exploiting cross-channel information lowers forecasting risk, and experiments on Time-HD demonstrate that U-Cast surpasses strong baselines in both accuracy and efficiency. Together, U-Cast and Time-HD provide a solid basis for future HDTSF research.
中文: U-Cast通过基于查询的注意力机制和全秩正则化,提出了一种通道依赖的预测架构,能有效建模高维时间序列中的层次化通道相关性,同时Time-HD基准数据集验证了该模型在精度和效率上的优越表现。
English: U-Cast introduces a channel-dependent forecasting architecture with query-based attention and full-rank regularization to effectively model hierarchical channel correlations in high-dimensional time series, while the Time-HD benchmark provides foundational datasets demonstrating U-Cast's superior accuracy and efficiency.
Authors:Mohammad-Maher Nakshbandi, Ziad Sharawy, Sorin Grigorescu
Abstract:
One of the main challenges in the Simultaneous Localization and Mapping (SLAM) loop closure problem is the recognition of previously visited places. In this work, we tackle the two main problems of real-time SLAM systems: 1) loop closure detection accuracy and 2) real-time computation constraints on the embedded hardware. Our LoopNet method is based on a multitasking variant of the classical ResNet architecture, adapted for online retraining on a dynamic visual dataset and optimized for embedded devices. The online retraining is designed using a few-shot learning approach. The architecture provides both an index into the queried visual dataset, and a measurement of the prediction quality. Moreover, by leveraging DISK (DIStinctive Keypoints) descriptors, LoopNet surpasses the limitations of handcrafted features and traditional deep learning methods, offering better performance under varying conditions. Code is available at https://github.com/RovisLab/LoopNet. Additinally, we introduce a new loop closure benchmarking dataset, coined LoopDB, which is available at https://github.com/RovisLab/LoopDB.
中文: LoopNet通过多任务ResNet架构结合少量样本在线学习,利用DISK描述符提升SLAM闭环检测的实时性和准确性,有效应对不同环境变化。
English: LoopNet addresses SLAM loop closure challenges by using a multitasking ResNet for real-time detection and few-shot online retraining, enhanced with DISK descriptors for robust performance under varying conditions.
Authors:Yiyuan Yang, Zichuan Liu, Lei Song, Kai Ying, Zhiguang Wang, Tom Bamford, Svitlana Vyetrenko, Jiang Bian, Qingsong Wen
Abstract:
Time series anomaly detection is critical across various domains, yet current approaches often limit analysis to mere binary anomaly classification without detailed categorization or further explanatory reasoning. To address these limitations, we propose a novel task, Time-series Reasoning for Anomaly (Time-RA) that transforms classical time series anomaly detection from a discriminative into a generative, reasoning-intensive task leveraging Large Language Models (LLMs). Also, we introduce the first real-world multimodal benchmark dataset, RATs40K, explicitly annotated for anomaly reasoning, comprising approximately 40,000 samples across 10 real-world domains. Each sample includes numeric time series data, contextual text information, and visual representations, each annotated with fine-grained categories (14 types for univariate anomalies and 6 for multivariate anomalies) and structured explanatory reasoning. We develop a sophisticated annotation framework utilizing ensemble-generated labels refined through GPT-4-driven feedback, ensuring accuracy and interpretability. Extensive benchmarking of LLMs and multimodal LLMs demonstrates the capabilities and limitations of current models, highlighting the critical role of supervised fine-tuning. Our dataset and task pave the way for significant advancements in interpretable time series anomaly detection and reasoning. The code (https://github.com/yyysjz1997/Time-RA) and dataset (https://huggingface.co/datasets/Time-RA/RATs40K) have been fully open-sourced to support and accelerate future research in this area.
中文摘要:该研究提出了Time-RA任务,利用大语言模型将时序异常检测转化为生成式推理任务,并发布了包含4万样本的多模态基准数据集RATs40K,推动可解释异常检测的发展。
English Summary: The study introduces Time-RA, a generative reasoning task using Large Language Models for detailed anomaly categorization and explanation in time series data, supported by the multimodal RATs40K benchmark dataset with fine-grained annotations.
Authors:Yiyuan Yang, Zichuan Liu, Lei Song, Kai Ying, Zhiguang Wang, Tom Bamford, Svitlana Vyetrenko, Jiang Bian, Qingsong Wen
Abstract:
Time series anomaly detection is critical across various domains, yet current approaches often limit analysis to mere binary anomaly classification without detailed categorization or further explanatory reasoning. To address these limitations, we propose a novel task, Time-series Reasoning for Anomaly (Time-RA) that transforms classical time series anomaly detection from a discriminative into a generative, reasoning-intensive task leveraging Large Language Models (LLMs). Also, we introduce the first real-world multimodal benchmark dataset, RATs40K, explicitly annotated for anomaly reasoning, comprising approximately 40,000 samples across 10 real-world domains. Each sample includes numeric time series data, contextual text information, and visual representations, each annotated with fine-grained categories (14 types for univariate anomalies and 6 for multivariate anomalies) and structured explanatory reasoning. We develop a sophisticated annotation framework utilizing ensemble-generated labels refined through GPT-4-driven feedback, ensuring accuracy and interpretability. Extensive benchmarking of LLMs and multimodal LLMs demonstrates the capabilities and limitations of current models, highlighting the critical role of supervised fine-tuning. Our dataset and task pave the way for significant advancements in interpretable time series anomaly detection and reasoning. The code (https://github.com/yyysjz1997/Time-RA) and dataset (https://huggingface.co/datasets/Time-RA/RATs40K) have been fully open-sourced to support and accelerate future research in this area.
中文摘要:该研究提出了Time-RA任务,利用大语言模型将时序异常检测转化为生成式推理任务,并发布了包含4万样本的多模态基准数据集RATs40K,推动可解释异常检测的发展。
English Summary: The study introduces Time-RA, a generative reasoning task using Large Language Models for detailed anomaly categorization and explanation in time series data, supported by the multimodal RATs40K benchmark dataset with fine-grained annotations.
Authors:Ran Zhang, Xuanhua He, Li Xueheng, Ke Cao, Liu Liu, Wenbo Xu, Fang Jiabin, Yang Qize, Jie Zhang
Abstract:
The field of pan-sharpening has recently seen a trend towards increasingly large and complex models, often trained on single, specific satellite datasets. This approach, however, leads to high computational overhead and poor generalization on full resolution data, a paradigm we challenge in this paper. In response to this issue, we propose PanTiny, a lightweight, single-step pan-sharpening framework designed for both efficiency and robust performance. More critically, we introduce multiple-in-one training paradigm, where a single, compact model is trained simultaneously on three distinct satellite datasets (WV2, WV3, and GF2) with different resolution and spectral information. Our experiments show that this unified training strategy not only simplifies deployment but also significantly boosts generalization on full-resolution data. Further, we introduce a universally powerful composite loss function that elevates the performance of almost all of models for pan-sharpening, pushing state-of-the-art metrics into a new era. Our PanTiny model, benefiting from these innovations, achieves a superior performance-to-efficiency balance, outperforming most larger, specialized models. Through extensive ablation studies, we validate that principled engineering in model design, training paradigms, and loss functions can surpass brute-force scaling. Our work advocates for a community-wide shift towards creating efficient, generalizable, and data-conscious models for pan-sharpening. The code is available at https://github.com/Zirconium233/PanTiny .
中文: 本文提出PanTiny轻量级全色锐化框架,通过多数据集联合训练和复合损失函数,在实现卓越性能与泛化能力的同时,突破了当前模型庞大且专一化的研究范式。
English: This paper introduces PanTiny, a lightweight and efficient pan-sharpening framework trained on multiple satellite datasets with a novel composite loss function, achieving superior performance and generalization while challenging the trend of large, specialized models.
Authors:Zhaotong Yang, Yuhui Li, Shengfeng He, Xinzhe Li, Yangyang Xu, Junyu Dong, Yong Du
Abstract:
Image-based Virtual Try-On (VTON) techniques rely on either supervised in-shop approaches, which ensure high fidelity but struggle with cross-domain generalization, or unsupervised in-the-wild methods, which improve adaptability but remain constrained by data biases and limited universality. A unified, training-free solution that works across both scenarios remains an open challenge. We propose OmniVTON, the first training-free universal VTON framework that decouples garment and pose conditioning to achieve both texture fidelity and pose consistency across diverse settings. To preserve garment details, we introduce a garment prior generation mechanism that aligns clothing with the body, followed by continuous boundary stitching technique to achieve fine-grained texture retention. For precise pose alignment, we utilize DDIM inversion to capture structural cues while suppressing texture interference, ensuring accurate body alignment independent of the original image textures. By disentangling garment and pose constraints, OmniVTON eliminates the bias inherent in diffusion models when handling multiple conditions simultaneously. Experimental results demonstrate that OmniVTON achieves superior performance across diverse datasets, garment types, and application scenarios. Notably, it is the first framework capable of multi-human VTON, enabling realistic garment transfer across multiple individuals in a single scene. Code is available at https://github.com/Jerome-Young/OmniVTON
中文摘要:OmniVTON是首个无需训练的统一虚拟试穿框架,通过解耦服装与姿态条件实现在多样化场景下的精细纹理保持与精准姿态对齐,首次支持多人场景的逼真服装转换。
English Summary: OmniVTON is a training-free universal virtual try-on framework that decouples garment and pose conditioning to achieve high texture fidelity and pose consistency across diverse scenarios, including multi-human applications.
Authors:Hao Li, Haoxiang Zhang, Ahmed E. Hassan
Abstract:
The future of software engineering--SE 3.0--is unfolding with the rise of AI teammates: autonomous, goal-driven systems collaborating with human developers. Among these, autonomous coding agents are especially transformative, now actively initiating, reviewing, and evolving code at scale. This paper introduces AIDev, the first large-scale dataset capturing how such agents operate in the wild. Spanning over 456,000 pull requests by five leading agents--OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code--across 61,000 repositories and 47,000 developers, AIDev provides an unprecedented empirical foundation for studying autonomous teammates in software development.
Unlike prior work that has largely theorized the rise of AI-native software engineering, AIDev offers structured, open data to support research in benchmarking, agent readiness, optimization, collaboration modeling, and AI governance. The dataset includes rich metadata on PRs, authorship, review timelines, code changes, and integration outcomes--enabling exploration beyond synthetic benchmarks like SWE-bench. For instance, although agents often outperform humans in speed, their PRs are accepted less frequently, revealing a trust and utility gap. Furthermore, while agents accelerate code submission--one developer submitted as many PRs in three days as they had in three years--these are structurally simpler (via code complexity metrics).
We envision AIDev as a living resource: extensible, analyzable, and ready for the SE and AI communities. Grounding SE 3.0 in real-world evidence, AIDev enables a new generation of research into AI-native workflows and supports building the next wave of symbiotic human-AI collaboration. The dataset is publicly available at https://github.com/SAILResearch/AI_Teammates_in_SE3.
> AI Agent, Agentic AI, Coding Agent, Agentic Coding, Software Engineering Agent
中文: AIDev数据集首次大规模实证研究软件工程中AI队友的工作模式,涵盖五个主流自主编程代理在61000个仓库中的456000次拉取请求,揭示了AI代理与人类开发者的协作特征及效能差异。
English: The AIDev dataset provides the first large-scale empirical foundation for studying AI teammates in software engineering, capturing over 456,000 pull requests from five leading autonomous coding agents across 61,000 repositories to analyze their real-world collaboration patterns and performance gaps compared to human developers.
Authors:Hao Zheng, Shunzhi Yang, Zhuoxin He, Jinfeng Yang, Zhenhua Huang
Abstract:
Pre-trained Vision-Language Models (VLMs) such as CLIP have shown excellent generalization abilities. However, adapting these large-scale models to downstream tasks while preserving their generalization capabilities remains challenging. Although prompt learning methods have shown promise, they suffer from two fundamental bottlenecks that limit generalization: (a) modality isolation, and (b) hierarchical semantic decay. To address these limitations, we propose HiCroPL, a Hierarchical Cross-modal Prompt Learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling them to refine their semantics mutually. HiCroPL routes knowledge flows by leveraging the complementary strengths of text and vision. In early layers, text prompts inject relatively clear semantics into visual prompts through a hierarchical knowledge mapper, enhancing the representation of low-level visual semantics. In later layers, visual prompts encoding specific task-relevant objects flow back to refine text prompts, enabling deeper alignment. Crucially, our hierarchical knowledge mapper allows representations at multi-scales to be fused, ensuring that deeper representations retain transferable shallow semantics thereby enhancing generalization. We further introduce a lightweight layer-specific knowledge proxy to enable efficient cross-modal interactions. Extensive evaluations across four tasks demonstrate HiCroPL's superior performance, achieving state-of-the-art results on 11 benchmarks with significant improvements. Code is available at: https://github.com/zzeoZheng/HiCroPL.
中文: HiCroPL提出了一种分层跨模态提示学习框架,通过建立文本与视觉模态间的双向知识流动实现语义相互增强,在多项基准测试中取得了最优性能。
English: HiCroPL introduces a hierarchical cross-modal prompt learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling mutual semantic refinement and achieving state-of-the-art performance across multiple benchmarks.
Authors:Saeid Ghafouri, Mohsen Fayyaz, Xiangchen Li, Deepu John, Bo Ji, Dimitrios Nikolopoulos, Hans Vandierendonck
Abstract:
Real-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency and energy overhead. Polymorph achieves 40% lower energy consumption and improves mAP by 9 points over strong baselines on the TAO dataset. Polymorph is open source at https://github.com/inference-serving/polymorph/.
Chinese: Polymorph是一种上下文感知框架,通过动态低秩适配器实现高效的实时多标签视频分类,在嵌入式设备上能耗降低40%,平均精度提升9个百分点。
English: Polymorph is a context-aware framework that uses dynamic Low Rank Adapters for efficient real-time multi-label video classification, reducing energy consumption by 40% and improving mAP by 9 points on embedded devices.
Authors:Hai Huang, Yan Xia, Shulei Wang, Hanting Wang, Minghui Fang, Shengpeng Ji, Sashuai Zhou, Tao Jin, Zhou Zhao
Abstract:
This paper extends Cross Modal Generalization (CMG) to open-set environments by proposing the more challenging Open-set Cross Modal Generalization (OSCMG) task. This task evaluates multimodal unified representations in open-set conditions, addressing the limitations of prior closed-set cross-modal evaluations. OSCMG requires not only cross-modal knowledge transfer but also robust generalization to unseen classes within new modalities, a scenario frequently encountered in real-world applications. Existing multimodal unified representation work lacks consideration for open-set environments. To tackle this, we propose MICU, comprising two key components: Fine-Coarse Masked multimodal InfoNCE (FCMI) and Cross modal Unified Jigsaw Puzzles (CUJP). FCMI enhances multimodal alignment by applying contrastive learning at both holistic semantic and temporal levels, incorporating masking to enhance generalization. CUJP enhances feature diversity and model uncertainty by integrating modality-agnostic feature selection with self-supervised learning, thereby strengthening the model's ability to handle unknown categories in open-set tasks. Extensive experiments on CMG and the newly proposed OSCMG validate the effectiveness of our approach. The code is available at https://github.com/haihuangcode/CMG.
中文摘要:本文提出开放集跨模态泛化任务OSCMG,并设计包含FCMI和CUJP模块的MICU方法,通过增强多模态对齐与特征多样性来提升开放集环境下的泛化能力。
English Summary: This paper introduces the Open-set Cross Modal Generalization (OSCMG) task and proposes MICU with FCMI and CUJP components to enhance multimodal alignment and feature diversity for robust generalization in open-set environments.
Authors:Xiaojie Li, Chu Li, Shi-Zhe Chen, Xi Chen
Abstract:
Universal multimodal retrieval (UMR), which aims to address complex retrieval tasks where both queries and candidates span diverse modalities, has been significantly advanced by the emergence of MLLMs. While state-of-the-art MLLM-based methods in the literature predominantly adopt contrastive learning principles, they often differ in their specific training recipes. Despite their success, the mechanisms underlying their retrieval capabilities remain largely unexplored, potentially resulting in suboptimal performance and limited generalization ability. To address these issues, we present a comprehensive study aimed at uncovering the key factors that drive effective embedding learning for UMR using MLLMs. We begin by implementing a general MLLM-based embedding learning pipeline, and systematically analyze the primary contributors to high-performing universal retrieval systems. Based on this, we explore various aspects of the details in embedding generation and training strategies, including progressive transition, hard negative mining and re-ranker distillation. Notably, our findings reveal that often-overlooked factors can have a substantial impact on model performance. Building on these discoveries, we introduce a unified framework termed U-MARVEL (\textbf{U}niversal \textbf{M}ultimod\textbf{A}l \textbf{R}etrie\textbf{V}al via \textbf{E}mbedding \textbf{L}earning), which outperforms state-of-the-art competitors on the M-BEIR benchmark by a large margin in supervised settings, and also exihibits strong zero-shot performance on several tasks such as composed image retrieval and text-to-video retrieval. These results underscore the generalization potential of our framework across various embedding-based retrieval tasks. Code is available at https://github.com/chaxjli/U-MARVEL
中文: 本研究提出U-MARVEL统一框架,通过分析多模态大模型嵌入学习的关键要素,在监督设定下显著超越现有方法,并在零样本任务中展现出强大的泛化能力。
English: This study introduces U-MARVEL, a unified framework that identifies key factors for effective multimodal retrieval using MLLMs and achieves superior performance on benchmarks through optimized embedding strategies.
Authors:Ronit D. Gross, Yarden Tzach, Tal Halevi, Ella Koresh, Ido Kanter
Abstract:
A prominent achievement of natural language processing (NLP) is its ability to understand and generate meaningful human language. This capability relies on complex feedforward transformer block architectures pre-trained on large language models (LLMs). However, LLM pre-training is currently feasible only for a few dominant companies due to the immense computational resources required, limiting broader research participation. This creates a critical need for more accessible alternatives. In this study, we explore whether tiny language models (TLMs) exhibit the same key qualitative features of LLMs. We demonstrate that TLMs exhibit a clear performance gap between pre-trained and non-pre-trained models across classification tasks, indicating the effectiveness of pre-training, even at a tiny scale. The performance gap increases with the size of the pre-training dataset and with greater overlap between tokens in the pre-training and classification datasets. Furthermore, the classification accuracy achieved by a pre-trained deep TLM architecture can be replicated through a soft committee of multiple, independently pre-trained shallow architectures, enabling low-latency TLMs without affecting classification accuracy. Our results are based on pre-training BERT-6 and variants of BERT-1 on subsets of the Wikipedia dataset and evaluating their performance on FewRel, AGNews, and DBPedia classification tasks. Future research on TLM is expected to further illuminate the mechanisms underlying NLP, especially given that its biologically inspired models suggest that TLMs may be sufficient for children or adolescents to develop language. The data and code that support the findings of this study are openly available on https://github.com/Rg32601/Tiny-Language-Models .
中文摘要:本研究表明,微型语言模型通过预训练能展现出与大型语言模型相似的关键特性,其性能随预训练数据量和任务相关性的增加而提升,且通过集成学习可在保持精度的同时实现低延迟处理。
English Summary: This study demonstrates that tiny language models (TLMs) exhibit significant performance improvements through pre-training similar to large language models, with effectiveness scaling alongside dataset size and token overlap, while showing that ensemble methods can achieve comparable accuracy with lower latency.
Authors:Xingshu Chen, Sicheng Yu, Chong Cheng, Hao Wang, Ting Tian
Abstract:
This paper investigates the problem of object detection with a focus on improving both the localization accuracy of bounding boxes and explicitly modeling prediction uncertainty. Conventional detectors rely on deterministic bounding box regression, ignoring uncertainty in predictions and limiting model robustness. In this paper, we propose an uncertainty-aware enhancement framework for DETR-based object detectors. We model bounding boxes as multivariate Gaussian distributions and incorporate the Gromov-Wasserstein distance into the loss function to better align the predicted and ground-truth distributions. Building on this, we derive a Bayes Risk formulation to filter high-risk information and improve detection reliability. We also propose a simple algorithm to quantify localization uncertainty via confidence intervals. Experiments on the COCO benchmark show that our method can be effectively integrated into existing DETR variants, enhancing their performance. We further extend our framework to leukocyte detection tasks, achieving state-of-the-art results on the LISC and WBCDD datasets. These results confirm the scalability of our framework across both general and domain-specific detection tasks. Code page: https://github.com/ParadiseforAndaChen/An-Uncertainty-aware-DETR-Enhancement-Framework-for-Object-Detection.
中文: 本文提出了一种面向DETR检测器的不确定性增强框架,通过将边界框建模为高斯分布并引入Gromov-Wasserstein距离来提升定位精度和不确定性建模能力,在通用与特定领域检测任务中均取得了最优性能。
English: This paper introduces an uncertainty-aware enhancement framework for DETR-based object detectors, modeling bounding boxes as Gaussian distributions and incorporating Gromov-Wasserstein distance to improve localization accuracy and uncertainty modeling, achieving state-of-the-art results on both general and domain-specific detection tasks.
Authors:Abdul-Kazeem Shamba, Kerstin Bach, Gavin Taylor
Abstract:
We revisit previous contrastive learning frameworks to investigate the effect of introducing an adaptive margin into the contrastive loss function for time series representation learning. Specifically, we explore whether an adaptive margin (eMargin), adjusted based on a predefined similarity threshold, can improve the separation between adjacent but dissimilar time steps and subsequently lead to better performance in downstream tasks. Our study evaluates the impact of this modification on clustering performance and classification in three benchmark datasets. Our findings, however, indicate that achieving high scores on unsupervised clustering metrics does not necessarily imply that the learned embeddings are meaningful or effective in downstream tasks. To be specific, eMargin added to InfoNCE consistently outperforms state-of-the-art baselines in unsupervised clustering metrics, but struggles to achieve competitive results in downstream classification with linear probing. The source code is publicly available at https://github.com/sfi-norwai/eMargin.
中文: 本研究在时间序列对比学习中引入自适应边界(eMargin),发现其虽能提升无监督聚类指标,但未能改善下游分类任务的性能。
English: This study introduces an adaptive margin (eMargin) into contrastive loss for time series learning, finding it improves unsupervised clustering metrics but fails to translate into better downstream classification performance.
Authors:Kunyu Yu, Rui Yang, Jingchi Liao, Siqi Li, Huitao Li, Irene Li, Yifan Peng, Rishikesan Kamaleswaran, Nan Liu
Abstract:
Foundation models have emerged as a powerful approach for processing electronic health records (EHRs), offering flexibility to handle diverse medical data modalities. In this study, we present a comprehensive benchmark that evaluates the performance, fairness, and interpretability of foundation models, both as unimodal encoders and as multimodal learners, using the publicly available MIMIC-IV database. To support consistent and reproducible evaluation, we developed a standardized data processing pipeline that harmonizes heterogeneous clinical records into an analysis-ready format. We systematically compared eight foundation models, encompassing both unimodal and multimodal models, as well as domain-specific and general-purpose variants. Our findings demonstrate that incorporating multiple data modalities leads to consistent improvements in predictive performance without introducing additional bias. Through this benchmark, we aim to support the development of effective and trustworthy multimodal artificial intelligence (AI) systems for real-world clinical applications. Our code is available at https://github.com/nliulab/MIMIC-Multimodal.
中文: 本研究基于MIMIC-IV数据库构建了综合评估基准,结果表明多模态融合能持续提升预测性能且不引入额外偏差,旨在推动可信赖医疗AI系统的发展。
English: This study establishes a comprehensive benchmark evaluating foundation models' performance, fairness, and interpretability using the MIMIC-IV database, demonstrating that multimodal integration enhances predictive accuracy without increasing bias.
Authors:Shoutao Guo, Shaolei Zhang, Qingkai Fang, Zhengrui Ma, Min Zhang, Yang Feng
Abstract:
The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.
中文:FastLongSpeech是一种创新框架,通过迭代压缩和动态训练方法,无需专门的长语音训练数据即可高效扩展大语音语言模型处理长语音的能力,同时在长短语音任务中均表现出色。
English: FastLongSpeech is a novel framework that efficiently extends large speech-language models to handle long-form speech processing through iterative compression and dynamic training, without requiring dedicated long-speech data, while maintaining strong performance in both long and short speech tasks.
Authors:Sam Johnson, Viet Pham, Thai Le
Abstract:
This work demonstrates that LLM-based web navigation agents offer powerful automation capabilities but are vulnerable to Indirect Prompt Injection (IPI) attacks. We show that adversaries can embed universal adversarial triggers in webpage HTML to hijack agent behavior that utilizes the accessibility tree to parse HTML, causing unintended or malicious actions. Using the Greedy Coordinate Gradient (GCG) algorithm and a Browser Gym agent powered by Llama-3.1, our system demonstrates high success rates across real websites in both targeted and general attacks, including login credential exfiltration and forced ad clicks. Our empirical results highlight critical security risks and the need for stronger defenses as LLM-driven autonomous web agents become more widely adopted. The system software (https://github.com/sej2020/manipulating-web-agents) is released under the MIT License, with an accompanying publicly available demo website (http://lethaiq.github.io/attack-web-llm-agent).
中文: 基于大语言模型的网页导航代理易受间接提示注入攻击,攻击者可通过在网页HTML中嵌入触发器来劫持代理行为,导致如窃取登录凭证等严重安全风险。
English: LLM-based web navigation agents are vulnerable to Indirect Prompt Injection attacks, where adversaries embed triggers in webpage HTML to hijack agent behavior, leading to security risks like credential theft and unauthorized actions.
Authors:Beier Zhu, Ruoyu Wang, Tong Zhao, Hanwang Zhang, Chi Zhang
Abstract:
Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face image quality degradation under a low-latency budget. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as \ours), a novel ODE solver that mitigates truncation errors by incorporating multiple parallel gradient evaluations in each ODE step. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling.
Our method optimizes a small set of learnable parameters in a distillation fashion, ensuring minimal training overhead.
In addition, our method can serve as a plugin to improve existing ODE samplers. Extensive experiments on various image synthesis benchmarks demonstrate the effectiveness of our \ours~in achieving high-quality and low-latency sampling. For example, at the same latency level of 5 NFE, EPD achieves an FID of 4.47 on CIFAR-10, 7.97 on FFHQ, 8.17 on ImageNet, and 8.26 on LSUN Bedroom, surpassing existing learning-based solvers by a significant margin. Codes are available in https://github.com/BeierZhu/EPD.
中文: 提出的集成并行方向求解器通过并行梯度评估减少截断误差,在保持低延迟采样的同时显著提升图像生成质量,可作为插件优化现有求解器。
English: The proposed Ensemble Parallel Direction solver enhances diffusion model efficiency by using parallel gradient evaluations to reduce truncation errors while maintaining low-latency sampling through full parallelization and minimal training overhead.
Authors:Joseph Raj Vishal, Divesh Basina, Rutuja Patil, Manas Srinivas Gowda, Katha Naik, Yezhou Yang, Bharatesh Chakravarthi
Abstract:
Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces \textbf{InterAct VideoQA}, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world deployable VideoQA models for intelligent transportation systems. GitHub Repo: https://github.com/joe-rabbit/InterAct_VideoQA
中文: 本文提出了InterAct VideoQA数据集,包含真实交通录像和超过2.5万个问答对,专门用于解决现有视频问答模型在处理复杂时空交通场景时的局限性,以推动智能交通系统的发展。
English: This paper introduces the InterAct VideoQA dataset, a specialized benchmark with real-world traffic footage and over 25,000 QA pairs, designed to address the limitations of existing VideoQA models in handling complex spatiotemporal traffic scenarios and to advance intelligent transportation systems.
Authors:RafaÅ Surdej, MichaÅ Bortkiewicz, Alex Lewandowski, Mateusz Ostaszewski, Clare Lyle
Abstract:
Trainable activation functions, whose parameters are optimized alongside network weights, offer increased expressivity compared to fixed activation functions. Specifically, trainable activation functions defined as ratios of polynomials (rational functions) have been proposed to enhance plasticity in reinforcement learning. However, their impact on training stability remains unclear. In this work, we study trainable rational activations in both reinforcement and continual learning settings. We find that while their flexibility enhances adaptability, it can also introduce instability, leading to overestimation in RL and feature collapse in longer continual learning scenarios. Our main result is demonstrating a trade-off between expressivity and plasticity in rational activations. To address this, we propose a constrained variant that structurally limits excessive output scaling while preserving adaptability. Experiments across MetaWorld and DeepMind Control Suite (DMC) environments show that our approach improves training stability and performance. In continual learning benchmarks, including MNIST with reshuffled labels and Split CIFAR-100, we reveal how different constraints affect the balance between expressivity and long-term retention. While preliminary experiments in discrete action domains (e.g., Atari) did not show similar instability, this suggests that the trade-off is particularly relevant for continuous control. Together, our findings provide actionable design principles for robust and adaptable trainable activations in dynamic, non-stationary environments. Code available at: https://github.com/special114/rl_rational_plasticity.
Chinese: 可训练有理激活函数在增强神经网络适应性的同时可能引发训练不稳定,我们提出的约束变体通过平衡表达性与可塑性,在强化学习和持续学习任务中提升了稳定性和性能。
English: Trainable rational activation functions enhance neural network adaptability but risk training instability, which our constrained variant mitigates by balancing expressivity and plasticity for improved performance in reinforcement and continual learning.
Authors:Vinicius Anjos de Almeida, Vinicius de Camargo, Raquel Gómez-Bravo, Egbert van der Haring, Kees van Boven, Marcelo Finger, Luis Fernandez Lopez
Abstract:
Background: Medical coding structures healthcare data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign ICPC-2 codes using the output of a domain-specific search engine.
Methods: A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI's text-embedding-3-large) retrieved candidates from 73,563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence.
Results: Twenty-eight models achieved F1-score > 0.8; ten exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (<3B) struggled with formatting and input length.
Conclusions: LLMs show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation.
中文摘要:研究表明大型语言模型能有效利用语义搜索结果自动分配ICPC-2医疗编码,最佳模型在保持正确格式和较低错误率的同时实现了高准确率。
English Summary: This study demonstrates that large language models can effectively automate ICPC-2 medical coding using semantic search results, with top models achieving high accuracy while maintaining proper formatting and minimal errors.
Authors:Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shenglong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, Jifeng Dai, Wenhai Wang
Abstract:
Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG. Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models are released at https://github.com/OpenGVLab/Docopilot
Chinese: 本文提出了高质量数据集Doc-750K以解决文档级多模态数据不足的问题,并开发了原生多模态模型Docopilot,该模型无需依赖检索增强生成就能在文档理解任务中实现更优性能。
English: This paper introduces Doc-750K, a high-quality dataset addressing the lack of document-level multimodal data, and presents Docopilot, a native multimodal model that outperforms existing methods in document understanding without relying on retrieval-augmented generation.
Authors:Qibing Ren, Sitao Xie, Longxuan Wei, Zhenfei Yin, Junchi Yan, Lizhuang Ma, Jing Shao
Abstract:
Recent large-scale events like election fraud and financial scams have shown how harmful coordinated efforts by human groups can be. With the rise of autonomous AI systems, there is growing concern that AI-driven groups could also cause similar harm. While most AI safety research focuses on individual AI systems, the risks posed by multi-agent systems (MAS) in complex real-world situations are still underexplored. In this paper, we introduce a proof-of-concept to simulate the risks of malicious MAS collusion, using a flexible framework that supports both centralized and decentralized coordination structures. We apply this framework to two high-risk fields: misinformation spread and e-commerce fraud. Our findings show that decentralized systems are more effective at carrying out malicious actions than centralized ones. The increased autonomy of decentralized systems allows them to adapt their strategies and cause more damage. Even when traditional interventions, like content flagging, are applied, decentralized groups can adjust their tactics to avoid detection. We present key insights into how these malicious groups operate and the need for better detection systems and countermeasures. Code is available at https://github.com/renqibing/RogueAgent.
中文: 本文提出一个模拟恶意多智能体系统风险的框架,研究表明去中心化AI群体在执行虚假信息传播和电商欺诈等恶意行为时更具适应性,能够规避传统干预措施。
English: This paper introduces a simulation framework to assess the risks of malicious multi-agent systems, demonstrating that decentralized AI groups are more effective at executing harmful actions like misinformation and fraud while evading traditional countermeasures.
Authors:Jifeng Shen, Haibo Zhan, Shaohua Dong, Xin Zuo, Wankou Yang, Haibin Ling
Abstract:
Modern multispectral feature fusion for object detection faces two critical limitations: (1) Excessive preference for local complementary features over cross-modal shared semantics adversely affects generalization performance; and (2) The trade-off between the receptive field size and computational complexity present critical bottlenecks for scalable feature modeling. Addressing these issues, a novel Multispectral State-Space Feature Fusion framework, dubbed MS2Fusion, is proposed based on the state space model (SSM), achieving efficient and effective fusion through a dual-path parametric interaction mechanism. More specifically, the first cross-parameter interaction branch inherits the advantage of cross-attention in mining complementary information with cross-modal hidden state decoding in SSM. The second shared-parameter branch explores cross-modal alignment with joint embedding to obtain cross-modal similar semantic features and structures through parameter sharing in SSM. Finally, these two paths are jointly optimized with SSM for fusing multispectral features in a unified framework, allowing our MS2Fusion to enjoy both functional complementarity and shared semantic space. In our extensive experiments on mainstream benchmarks including FLIR, M3FD and LLVIP, our MS2Fusion significantly outperforms other state-of-the-art multispectral object detection methods, evidencing its superiority. Moreover, MS2Fusion is general and applicable to other multispectral perception tasks. We show that, even without specific design, MS2Fusion achieves state-of-the-art results on RGB-T semantic segmentation and RGBT salient object detection, showing its generality. The source code will be available at https://github.com/61s61min/MS2Fusion.git.
中文:MS2Fusion框架通过状态空间模型中的双路径参数交互机制,解决了多光谱特征融合中的关键限制,在多个基准测试的对象检测及其他多光谱任务中实现了卓越性能。
English: The proposed MS2Fusion framework addresses limitations in multispectral feature fusion by employing a dual-path parametric interaction mechanism within a state space model, achieving superior performance in object detection and other multispectral tasks across multiple benchmarks.
Authors:Guoping Xu, Christopher Kabat, You Zhang
Abstract:
Recent advances in medical image segmentation have been driven by deep learning; however, most existing methods remain limited by modality-specific designs and exhibit poor adaptability to dynamic medical imaging scenarios. The Segment Anything Model 2 (SAM2) and its related variants, which introduce a streaming memory mechanism for real-time video segmentation, present new opportunities for prompt-based, generalizable solutions. Nevertheless, adapting these models to medical video scenarios typically requires large-scale datasets for retraining or transfer learning, leading to high computational costs and the risk of catastrophic forgetting. To address these challenges, we propose DD-SAM2, an efficient adaptation framework for SAM2 that incorporates a Depthwise-Dilated Adapter (DD-Adapter) to enhance multi-scale feature extraction with minimal parameter overhead. This design enables effective fine-tuning of SAM2 on medical videos with limited training data. Unlike existing adapter-based methods focused solely on static images, DD-SAM2 fully exploits SAM2's streaming memory for medical video object tracking and segmentation. Comprehensive evaluations on TrackRad2025 (tumor segmentation) and EchoNet-Dynamic (left ventricle tracking) datasets demonstrate superior performance, achieving Dice scores of 0.93 and 0.97, respectively. To the best of our knowledge, this work provides an initial attempt at systematically exploring adapter-based SAM2 fine-tuning for medical video segmentation and tracking. Code, datasets, and models will be publicly available at https://github.com/apple1986/DD-SAM2.
中文: DD-SAM2框架通过引入深度可分离膨胀适配器,有效优化了Segment Anything Model 2在医学视频分割中的应用,在专业数据集上以最小计算成本实现了卓越性能。
English: The DD-SAM2 framework efficiently adapts the Segment Anything Model 2 for medical video segmentation by incorporating a Depthwise-Dilated Adapter, achieving superior performance with minimal computational overhead on specialized datasets.
Authors:Andrea Moschetto, Lemuel Puglisi, Alec Sargood, Pierluigi Dell'Acqua, Francesco Guarnera, Sebastiano Battiato, Daniele Ravì
Abstract:
Magnetic Resonance Imaging (MRI) enables the acquisition of multiple image contrasts, such as T1-weighted (T1w) and T2-weighted (T2w) scans, each offering distinct diagnostic insights. However, acquiring all desired modalities increases scan time and cost, motivating research into computational methods for cross-modal synthesis. To address this, recent approaches aim to synthesize missing MRI contrasts from those already acquired, reducing acquisition time while preserving diagnostic quality. Image-to-image (I2I) translation provides a promising framework for this task. In this paper, we present a comprehensive benchmark of generative models$\unicode{x2013}$specifically, Generative Adversarial Networks (GANs), diffusion models, and flow matching (FM) techniques$\unicode{x2013}$for T1w-to-T2w 2D MRI I2I translation. All frameworks are implemented with comparable settings and evaluated on three publicly available MRI datasets of healthy adults. Our quantitative and qualitative analyses show that the GAN-based Pix2Pix model outperforms diffusion and FM-based methods in terms of structural fidelity, image quality, and computational efficiency. Consistent with existing literature, these results suggest that flow-based models are prone to overfitting on small datasets and simpler tasks, and may require more data to match or surpass GAN performance. These findings offer practical guidance for deploying I2I translation techniques in real-world MRI workflows and highlight promising directions for future research in cross-modal medical image synthesis. Code and models are publicly available at https://github.com/AndreaMoschetto/medical-I2I-benchmark.
中文: 本文对跨模态磁共振图像合成的生成模型进行基准测试,发现基于GAN的Pix2Pix在图像质量和效率上优于扩散模型与流匹配方法,同时揭示了流模型存在数据依赖性问题。
English: This paper benchmarks generative models for cross-modal MRI synthesis, finding that GAN-based Pix2Pix outperforms diffusion and flow matching methods in image quality and efficiency while highlighting flow models' data dependency issues.
Authors:Sujata Gaihre, Amir Thapa Magar, Prasuna Pokharel, Laxmi Tiwari
Abstract:
This paper describes our approach to Subtask 1 of the ImageCLEFmed MEDVQA 2025 Challenge, which targets visual question answering (VQA) for gastrointestinal endoscopy. We adopt the Florence model-a large-scale multimodal foundation model-as the backbone of our VQA pipeline, pairing a powerful vision encoder with a text encoder to interpret endoscopic images and produce clinically relevant answers. To improve generalization, we apply domain-specific augmentations that preserve medical features while increasing training diversity. Experiments on the KASVIR dataset show that fine-tuning Florence yields accurate responses on the official challenge metrics. Our results highlight the potential of large multimodal models in medical VQA and provide a strong baseline for future work on explainability, robustness, and clinical integration. The code is publicly available at: https://github.com/TiwariLaxuu/VQA-Florence.git
中文: 本文采用Florence多模态基础模型构建胃肠内镜视觉问答系统,通过领域特定数据增强和微调在MEDVQA 2025挑战中取得优异表现,为医疗VQA应用提供了重要基准。
English: This paper presents a Florence-based VQA system for gastrointestinal endoscopy that uses domain-specific augmentation and fine-tuning to achieve strong performance on the MEDVQA 2025 challenge, demonstrating the potential of large multimodal models in medical applications.
Authors:Wenxuan Zeng, Tianshi Xu, Yi Chen, Yifan Zhou, Mingzhe Zhang, Jin Tan, Cheng Hong, Meng Li
Abstract:
Privacy-preserving machine learning (PPML) based on cryptographic protocols has emerged as a promising paradigm to protect user data privacy in cloud-based machine learning services. While it achieves formal privacy protection, PPML often incurs significant efficiency and scalability costs due to orders of magnitude overhead compared to the plaintext counterpart. Therefore, there has been a considerable focus on mitigating the efficiency gap for PPML. In this survey, we provide a comprehensive and systematic review of recent PPML studies with a focus on cross-level optimizations. Specifically, we categorize existing papers into protocol level, model level, and system level, and review progress at each level. We also provide qualitative and quantitative comparisons of existing works with technical insights, based on which we discuss future research directions and highlight the necessity of integrating optimizations across protocol, model, and system levels. We hope this survey can provide an overarching understanding of existing approaches and potentially inspire future breakthroughs in the PPML field. As the field is evolving fast, we also provide a public GitHub repository to continuously track the developments, which is available at https://github.com/PKU-SEC-Lab/Awesome-PPML-Papers.
Chinese: 本综述系统梳理了隐私保护机器学习(PPML)的研究进展,重点关注协议、模型和系统层面的跨层级优化方法,以在保障数据隐私的同时提升计算效率。
English: This survey comprehensively reviews privacy-preserving machine learning (PPML) studies, focusing on cross-level optimizations across protocol, model, and system levels to address efficiency challenges while ensuring data privacy.
Authors:Yitong Lin, Jiaying He, Jiahe Chen, Xinnan Zhu, Jianwei Zheng, Tao Bo
Abstract:
Motivation: Biomedical knowledge graphs (KGs) are crucial for drug discovery and disease understanding, yet their completion and reasoning are challenging. Knowledge Embedding (KE) methods capture global semantics but struggle with dynamic structural integration, while Graph Neural Networks (GNNs) excel locally but often lack semantic understanding. Even ensemble approaches, including those leveraging language models, often fail to achieve a deep, adaptive, and synergistic co-evolution between semantic comprehension and structural learning. Addressing this critical gap in fostering continuous, reciprocal refinement between these two aspects in complex biomedical KGs is paramount.
Results: We introduce BioGraphFusion, a novel framework for deeply synergistic semantic and structural learning. BioGraphFusion establishes a global semantic foundation via tensor decomposition, guiding an LSTM-driven mechanism to dynamically refine relation embeddings during graph propagation. This fosters adaptive interplay between semantic understanding and structural learning, further enhanced by query-guided subgraph construction and a hybrid scoring mechanism. Experiments across three key biomedical tasks demonstrate BioGraphFusion's superior performance over state-of-the-art KE, GNN, and ensemble models. A case study on Cutaneous Malignant Melanoma 1 (CMM1) highlights its ability to unveil biologically meaningful pathways.
Availability and Implementation: Source code and all training data are freely available for download at https://github.com/Y-TARL/BioGraphFusion.
Supplementary information: Supplementary data are available at Bioinformatics online.
中文摘要:BioGraphFusion是一个创新框架,通过在生物医学知识图谱中实现语义理解与结构学习的深度协同,在多项任务中展现出卓越性能,并能揭示具有生物学意义的通路。
English Summary: BioGraphFusion is a novel framework that achieves deep synergy between semantic and structural learning in biomedical knowledge graphs, demonstrating superior performance across multiple tasks and revealing biologically meaningful pathways.
Authors:Chi Wan, Yixin Cui, Jiatong Du, Shuo Yang, Yulong Bai, Peng Yi, Nan Li, Yanjun Huang
Abstract:
End-to-end autonomous driving requires adaptive and robust handling of complex and diverse traffic environments. However, prevalent single-mode planning methods attempt to learn an overall policy while struggling to acquire diversified driving skills to handle diverse scenarios. Therefore, this paper proposes GEMINUS, a Mixture-of-Experts end-to-end autonomous driving framework featuring a Global Expert and a Scene-Adaptive Experts Group, equipped with a Dual-aware Router. Specifically, the Global Expert is trained on the overall dataset, possessing robust performance. The Scene-Adaptive Experts are trained on corresponding scene subsets, achieving adaptive performance. The Dual-aware Router simultaneously considers scenario-level features and routing uncertainty to dynamically activate expert modules. Through the effective coupling of the Global Expert and the Scene-Adaptive Experts Group via the Dual-aware Router, GEMINUS achieves both adaptability and robustness across diverse scenarios. GEMINUS outperforms existing methods in the Bench2Drive closed-loop benchmark and achieves state-of-the-art performance in Driving Score and Success Rate, even with only monocular vision input. The code is available at https://github.com/newbrains1/GEMINUS.
中文: 本文提出GEMINUS框架,通过双感知路由器将全局专家与场景自适应专家相结合,实现了自动驾驶在多样化场景中的卓越适应性和鲁棒性,并在基准测试中达到最优性能。
English: This paper introduces GEMINUS, a Mixture-of-Experts framework that enhances autonomous driving by combining a robust Global Expert with scene-specific experts through a Dual-aware Router, achieving superior adaptability and state-of-the-art performance on benchmarks.
Authors:Chun-Ming Yang, Pranav A. Bhounsule
Abstract:
Time-delay embedding is a technique that uses snapshots of state history over time to build a linear state space model of a nonlinear smooth system. We demonstrate that periodic non-smooth or hybrid system can also be modeled as a linear state space system using this approach as long as its behavior is consistent in modes and timings. We extend time-delay embeddings to generate a linear model of two periodic hybrid systems: the bouncing pendulum and the simplest walker with control inputs. This leads to a state history augmented linear quadratic regulator (LQR) which uses current and past state history for feedback control. Example code can be found at https://github.com/Chun-MingYang/koopman-timeDelay-lqr.git
中文:时滞嵌入技术可将周期性混合系统(如弹跳摆)构建为线性模型,从而形成基于状态历史反馈的线性二次型调节器。
English: Time-delay embedding enables linear modeling of periodic hybrid systems like bouncing pendulums, creating a state history-based linear quadratic regulator for feedback control.
Authors:Weikang Gu, Mingyue Han, Li Xue, Heng Dong, Changcai Yang, Riqing Chen, Lifang Wei
Abstract:
The accurate identification of high-quality correspondences is a prerequisite task in feature-based point cloud registration. However, it is extremely challenging to handle the fusion of local and global features due to feature redundancy and complex spatial relationships. Given that Gestalt principles provide key advantages in analyzing local and global relationships, we propose a novel Gestalt-guided Parallel Interaction Network via orthogonal geometric consistency (GPI-Net) in this paper. It utilizes Gestalt principles to facilitate complementary communication between local and global information. Specifically, we introduce an orthogonal integration strategy to optimally reduce redundant information and generate a more compact global structure for high-quality correspondences. To capture geometric features in correspondences, we leverage a Gestalt Feature Attention (GFA) block through a hybrid utilization of self-attention and cross-attention mechanisms. Furthermore, to facilitate the integration of local detail information into the global structure, we design an innovative Dual-path Multi-Granularity parallel interaction aggregation (DMG) block to promote information exchange across different granularities. Extensive experiments on various challenging tasks demonstrate the superior performance of our proposed GPI-Net in comparison to existing methods. The code will be released at https://github.com/gwk429/GPI-Net.
中文摘要:本文提出GPI-Net,通过格式塔原则指导的并行交互网络,利用正交几何一致性优化局部与全局特征融合,在点云配准任务中展现出优于现有方法的性能。
English Summary: This paper introduces GPI-Net, a Gestalt-guided network that enhances point cloud registration by integrating local and global features through orthogonal geometric consistency and attention mechanisms, demonstrating superior performance in various tasks.
Authors:Praneeth Namburi, Roger Pallarès-López, Jessica Rosendorf, Duarte Folgado, Brian W. Anthony
Abstract:
Ultrasound technology enables safe, non-invasive imaging of dynamic tissue behavior, making it a valuable tool in medicine, biomechanics, and sports science. However, accurately tracking tissue motion in B-mode ultrasound remains challenging due to speckle noise, low edge contrast, and out-of-plane movement. These challenges complicate the task of tracking anatomical landmarks over time, which is essential for quantifying tissue dynamics in many clinical and research applications. This manuscript introduces DUSTrack (Deep learning and optical flow-based toolkit for UltraSound Tracking), a semi-automated framework for tracking arbitrary points in B-mode ultrasound videos. We combine deep learning with optical flow to deliver high-quality and robust tracking across diverse anatomical structures and motion patterns. The toolkit includes a graphical user interface that streamlines the generation of high-quality training data and supports iterative model refinement. It also implements a novel optical-flow-based filtering technique that reduces high-frequency frame-to-frame noise while preserving rapid tissue motion. DUSTrack demonstrates superior accuracy compared to contemporary zero-shot point trackers and performs on par with specialized methods, establishing its potential as a general and foundational tool for clinical and biomechanical research. We demonstrate DUSTrack's versatility through three use cases: cardiac wall motion tracking in echocardiograms, muscle deformation analysis during reaching tasks, and fascicle tracking during ankle plantarflexion. As an open-source solution, DUSTrack offers a powerful, flexible framework for point tracking to quantify tissue motion from ultrasound videos. DUSTrack is available at https://github.com/praneethnamburi/DUSTrack.
中文: DUSTrack是一种结合深度学习与光流的半自动化工具包,能精准追踪B超视频中的任意点位,有效克服散斑噪声等难题,在临床和生物力学应用中性能优于现有方法。
English: DUSTrack is a semi-automated toolkit combining deep learning and optical flow to accurately track arbitrary points in B-mode ultrasound videos, overcoming challenges like speckle noise and outperforming existing methods in clinical and biomechanical applications.
Authors:Hui Yang, Jiaoyan Chen, Yuan He, Yongsheng Gao, Ian Horrocks
Abstract:
OWL (Web Ontology Language) ontologies which are able to formally represent complex knowledge and support semantic reasoning have been widely adopted across various domains such as healthcare and bioinformatics. Recently, ontology embeddings have gained wide attention due to its potential to infer plausible new knowledge and approximate complex reasoning. However, existing methods face notable limitations: geometric model-based embeddings typically overlook valuable textual information, resulting in suboptimal performance, while the approaches that incorporate text, which are often based on language models, fail to preserve the logical structure. In this work, we propose a new ontology embedding method OnT, which tunes a Pretrained Language Model (PLM) via geometric modeling in a hyperbolic space for effectively incorporating textual labels and simultaneously preserving class hierarchies and other logical relationships of Description Logic EL. Extensive experiments on four real-world ontologies show that OnT consistently outperforms the baselines including the state-of-the-art across both tasks of prediction and inference of axioms. OnT also demonstrates strong potential in real-world applications, indicated by its robust transfer learning abilities and effectiveness in real cases of constructing a new ontology from SNOMED CT. Data and code are available at https://github.com/HuiYang1997/OnT.
OWL本体广泛应用于知识表示和推理,但现有嵌入方法要么忽略文本信息,要么无法保持逻辑结构,因此提出了OnT这一新方法,它通过双曲几何建模结合预训练语言模型,有效整合文本并维护逻辑关系,在多个真实本体上的预测和推理任务中均表现出优越性能。
OWL ontologies are widely used for knowledge representation and reasoning, but current embedding methods either neglect textual information or fail to preserve logical structures, leading to the development of OnT, a novel approach that combines pretrained language models with hyperbolic geometric modeling to effectively integrate text and maintain logical relationships, demonstrating superior performance in prediction and inference tasks across multiple real-world ontologies.
Authors:Aryana Hou, Li Lin, Justin Li, Shu Hu
Abstract:
Generative AI models have substantially improved the realism of synthetic media, yet their misuse through sophisticated DeepFakes poses significant risks. Despite recent advances in deepfake detection, fairness remains inadequately addressed, enabling deepfake markers to exploit biases against specific populations. While previous studies have emphasized group-level fairness, individual fairness (i.e., ensuring similar predictions for similar individuals) remains largely unexplored. In this work, we identify for the first time that the original principle of individual fairness fundamentally fails in the context of deepfake detection, revealing a critical gap previously unexplored in the literature. To mitigate it, we propose the first generalizable framework that can be integrated into existing deepfake detectors to enhance individual fairness and generalization. Extensive experiments conducted on leading deepfake datasets demonstrate that our approach significantly improves individual fairness while maintaining robust detection performance, outperforming state-of-the-art methods. The code is available at https://github.com/Purdue-M2/Individual-Fairness-Deepfake-Detection.
中文: 本研究首次提出一个创新框架,以解决深度伪造检测中被忽视的个体公平性问题,在多个主流数据集上显著提升了公平性和检测性能。
English: This study introduces a pioneering framework to address the overlooked issue of individual fairness in deepfake detection, enhancing both fairness and performance across leading datasets.
Authors:Qiyu Xu, Zhanxuan Hu, Yu Duan, Ercheng Pei, Yonghang Tai
Abstract:
Generalized Category Discovery (GCD) aims to classify unlabeled data from both known and unknown categories by leveraging knowledge from labeled known categories. While existing methods have made notable progress, they often overlook a hidden stumbling block in GCD: distracted attention. Specifically, when processing unlabeled data, models tend to focus not only on key objects in the image but also on task-irrelevant background regions, leading to suboptimal feature extraction. To remove this stumbling block, we propose Attention Focusing (AF), an adaptive mechanism designed to sharpen the model's focus by pruning non-informative tokens. AF consists of two simple yet effective components: Token Importance Measurement (TIME) and Token Adaptive Pruning (TAP), working in a cascade. TIME quantifies token importance across multiple scales, while TAP prunes non-informative tokens by utilizing the multi-scale importance scores provided by TIME. AF is a lightweight, plug-and-play module that integrates seamlessly into existing GCD methods with minimal computational overhead. When incorporated into one prominent GCD method, SimGCD, AF achieves up to 15.4% performance improvement over the baseline with minimal computational overhead. The implementation code is provided in https://github.com/Afleve/AFGCD.
中文: 摘要提出注意力聚焦(AF)机制,通过剪除非信息性标记来增强广义类别发现,有效聚焦关键对象并减少背景干扰,以极低计算成本显著提升模型性能。
English: The abstract introduces Attention Focusing (AF), an adaptive mechanism that enhances Generalized Category Discovery by pruning non-informative tokens to sharpen model focus and reduce background distraction, achieving significant performance improvements with minimal computational cost.
Authors:Licheng Liu, Zihan Wang, Linjie Li, Chenwei Xu, Yiping Lu, Han Liu, Avirup Sil, Manling Li
Abstract:
Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., "Let's try again") after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling language models to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: https://github.com/lichengliu03/unary-feedback
Chinese: 本研究提出单点反馈作为观察(UFO)的强化学习方法,通过使用极简的单点反馈来增强大型推理模型在单轮和多轮问题解决中的表现,将多轮推理准确率提升高达14%,同时保持单轮性能。
English: This study introduces Unary Feedback as Observation (UFO), a reinforcement learning approach that uses minimal unary feedback to enhance large reasoning models' performance in both single-turn and multi-turn problem-solving, improving multi-turn reasoning accuracy by up to 14% while maintaining single-turn capabilities.
Authors:Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu Su
Abstract:
The rapid development of autonomous web agents powered by Large Language Models (LLMs), while greatly elevating efficiency, exposes the frontier risk of taking unintended or harmful actions. This situation underscores an urgent need for effective safety measures, akin to access controls for human users. To address this critical challenge, we introduce WebGuard, the first comprehensive dataset designed to support the assessment of web agent action risks and facilitate the development of guardrails for real-world online environments. In doing so, WebGuard specifically focuses on predicting the outcome of state-changing actions and contains 4,939 human-annotated actions from 193 websites across 22 diverse domains, including often-overlooked long-tail websites. These actions are categorized using a novel three-tier risk schema: SAFE, LOW, and HIGH. The dataset includes designated training and test splits to support evaluation under diverse generalization settings. Our initial evaluations reveal a concerning deficiency: even frontier LLMs achieve less than 60% accuracy in predicting action outcomes and less than 60% recall in lagging HIGH-risk actions, highlighting the risks of deploying current-generation agents without dedicated safeguards. We therefore investigate fine-tuning specialized guardrail models using WebGuard. We conduct comprehensive evaluations across multiple generalization settings and find that a fine-tuned Qwen2.5VL-7B model yields a substantial improvement in performance, boosting accuracy from 37% to 80% and HIGH-risk action recall from 20% to 76%. Despite these improvements, the performance still falls short of the reliability required for high-stakes deployment, where guardrails must approach near-perfect accuracy and recall.
中文: 大型语言模型驱动的自主网络代理快速发展带来了意外有害行为的高风险,为此开发的WebGuard数据集通过评估行动风险和训练防护模型,揭示了现有模型的安全缺陷,尽管微调后性能显著提升,但仍需近乎完美的保障措施才能满足高风险部署要求。
English: The rapid advancement of LLM-powered autonomous web agents introduces significant risks of unintended harmful actions, prompting the development of WebGuard, a comprehensive dataset for assessing action risks and training guardrail models, which reveals current models' safety deficiencies and the need for near-perfect safeguards despite performance improvements from fine-tuning.
Authors:Ravin Kumar
Abstract:
We propose the APTx Neuron, a novel, unified neural computation unit that integrates non-linear activation and linear transformation into a single trainable expression. The APTx Neuron is derived from the APTx activation function, thereby eliminating the need for separate activation layers and making the architecture both computationally efficient and elegant. The proposed neuron follows the functional form $y = \sum_{i=1}^{n} ((α_i + \tanh(β_i x_i)) \cdot γ_i x_i) + δ$, where all parameters $α_i$, $β_i$, $γ_i$, and $δ$ are trainable. We validate our APTx Neuron-based architecture on the MNIST dataset, achieving up to $96.69\%$ test accuracy within 11 epochs using approximately 332K trainable parameters. The results highlight the superior expressiveness and computational efficiency of the APTx Neuron compared to traditional neurons, pointing toward a new paradigm in unified neuron design and the architectures built upon it. Source code is available at https://github.com/mr-ravin/aptx_neuron.
Chinese: APTx神经元是一种新颖的神经计算单元,将激活和变换功能整合为单一可训练表达式,在MNIST数据集上以约33.2万参数实现96.69%的准确率,显著提升了计算效率。
English: The APTx Neuron is a novel neural computation unit that integrates activation and transformation into a single trainable expression, achieving 96.69% accuracy on MNIST with enhanced computational efficiency.
Authors:Jakub Walczak, Piotr Tomalak, Artur Laskowski
Abstract:
Generative AI is gaining increasing attention in software engineering, where testing remains an indispensable reliability mechanism. According to the widely adopted testing pyramid, unit tests constitute the majority of test cases and are often schematic, requiring minimal domain expertise. Automatically generating such tests under the supervision of software engineers can significantly enhance productivity during the development phase of the software lifecycle.
This paper investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests generated by various large language models (LLMs) across several families. The results show that including docstrings notably improves code adequacy, while further extending context to the full implementation yields definitely smaller gains. Notably, the chain-of-thought prompting strategy -- applied even to 'reasoning' models -- achieves the best results, with up to 96.3\% branch coverage, a 57\% average mutation score, and near-perfect compilation success rate. Among the evaluated models, M5 (Gemini 2.5 Pro) demonstrated superior performance in both mutation score and branch coverage being still in top in terms of compilation success rate.
All the code and resulting test suites are publicly available at https://github.com/peetery/LLM-analysis.
中文摘要:本研究探讨代码上下文和提示策略对大型语言模型生成单元测试质量的影响,发现包含文档字符串可显著提高测试充分性,而链式思维提示策略效果最佳,其中Gemini 2.5 Pro在评估模型中表现最优。
English Summary: This study explores how code context and prompting strategies affect the quality of unit tests generated by large language models, finding that docstrings significantly enhance test adequacy and chain-of-thought prompting achieves the best results, with Gemini 2.5 Pro performing best among evaluated models.
Authors:Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan, Lin
Abstract:
Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets. Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache's effectiveness in enhancing LLMs' long-range capabilities. Our code is available at https://github.com/GATECH-EIC/LaCache.
中文:LaCache是一种无需训练的KV缓存优化方法,通过阶梯式缓存模式和迭代压缩机制,有效提升大语言模型的长程处理能力和持续生成效率,并在多类基准测试中得到验证。
English: LaCache is a training-free KV cache optimization method that enhances LLMs' long-range capabilities and continuous generation efficiency through a ladder-shaped cache pattern and iterative compaction mechanism, validated across diverse benchmarks.
Authors:Shengji Tang, Jianjian Cao, Weihao Lin, Jiale Hong, Bo Zhang, Shuyue Hu, Lei Bai, Tao Chen, Wanli Ouyang, Peng Ye
Abstract:
This paper aims to demonstrate the potential and strengths of open-source collectives. It leads to a promising question: Can we harness multiple open-source LLMs to match or even beat the closed-source LLMs? To answer this, we propose SMACS, a scalable multi-agent collaboration system (MACS) framework with high performance. Specifically, for continuous integration of new LLMs and generalization to diverse questions, we first propose a Retrieval-based Prior Selection (RPS), which assigns a proxy performance score to each LLM to select the Top-k LLMs at the instance level for any given question. Then, we propose an Exploration-Exploitation-Driven Posterior Enhancement (EPE), encouraging the generation of diverse responses through prior dropping and selecting the high-quality response via a hybrid posterior score. Experiments on eight mainstream benchmarks validate the effectiveness of our SMACS: by integrating fifteen open-source LLMs, SMACS outperforms leading closed-source LLMs in 2025, e.g., Claude-3.7-Sonnet (+12.73%), GPT-4.1(+5.36%) and GPT-o3-mini(+5.28%) across multiple tasks. Remarkably, it even exceeds the average of best results of different datasets from both open-source LLMs (+2.86%) and closed-source LLMs (+2.04%), pushing the upper bound of intelligence. Code will be released at https://github.com/magent4aci/SMACS.
中文: 本文提出SMACS可扩展多智能体协作系统,通过基于检索的优先选择和探索-利用驱动的后验增强机制,成功整合多个开源大语言模型,在多项基准测试中超越主流闭源模型性能。
English: This paper introduces SMACS, a scalable multi-agent collaboration system that effectively integrates multiple open-source LLMs through retrieval-based selection and posterior enhancement, outperforming leading closed-source models across multiple benchmarks.
Authors:Julien Pourcel, Cédric Colas, Pierre-Yves Oudeyer
Abstract:
Many program synthesis tasks prove too challenging for even state-of-the-art language models to solve in single attempts. Search-based evolutionary methods offer a promising alternative by exploring solution spaces iteratively, but their effectiveness remain limited by the fixed capabilities of the underlying generative model.
We propose SOAR, a method that learns program synthesis by integrating language models into a self-improving evolutionary loop.
SOAR alternates between (1) an evolutionary search that uses an LLM to sample and refine candidate solutions, and (2) a hindsight learning phase that converts search attempts into valid problem-solution pairs used to fine-tune the LLM's sampling and refinement capabilities\, -- \,enabling increasingly effective search in subsequent iterations.
On the challenging ARC-AGI benchmark, SOAR achieves significant performance gains across model scales and iterations, leveraging positive transfer between the sampling and refinement finetuning tasks. These improvements carry over to test-time adaptation, enabling SOAR to solve 52\% of the public test set. Our code is open-sourced at: https://github.com/flowersteam/SOAR
中文: SOAR是一种自改进的进化方法,通过将语言模型整合到进化搜索与后见学习的迭代循环中,利用搜索尝试生成的问题-解决方案对微调模型,在ARC-AGI基准测试中实现了显著性能提升。
English: SOAR is a self-improving evolutionary method that integrates language models into an iterative loop of evolutionary search and hindsight learning, achieving significant performance gains on the ARC-AGI benchmark by fine-tuning the model with problem-solution pairs from search attempts.
Authors:Renxiang Qiu, Raghavendra Selvan
Abstract:
We present UniPhyNet, a novel neural network architecture to classify cognitive load using multimodal physiological data -- specifically EEG, ECG and EDA signals -- without the explicit need for extracting hand-crafted features. UniPhyNet integrates multiscale parallel convolutional blocks and ResNet-type blocks enhanced with channel block attention module to focus on the informative features while a bidirectional gated recurrent unit is used to capture temporal dependencies. This architecture processes and combines signals in both unimodal and multimodal configurations via intermediate fusion of learned feature maps. On the CL-Drive dataset, UniPhyNet improves raw signal classification accuracy from 70% to 80% (binary) and 62% to 74% (ternary), outperforming feature-based models, demonstrating its effectiveness as an end-to-end solution for real-world cognitive state monitoring.
UniPhyNet是一种新型神经网络,通过多尺度卷积和注意力机制直接从EEG、ECG和EDA原始信号中分类认知负荷,在CL-Drive数据集上实现了显著准确率提升。
UniPhyNet is a novel neural network that classifies cognitive load directly from raw EEG, ECG, and EDA signals using multiscale convolutional and attention mechanisms, achieving significant accuracy improvements on the CL-Drive dataset.
Authors:Kai Yi, Kiarash Jamali, Sjors H. W. Scheres
Abstract:
The recent breakthrough of AlphaFold3 in modeling complex biomolecular interactions, including those between proteins and ligands, nucleotides, or metal ions, creates new opportunities for protein design. In so-called inverse protein folding, the objective is to find a sequence of amino acids that adopts a target protein structure. Many inverse folding methods struggle to predict sequences for complexes that contain non-protein components, and perform poorly with complexes that adopt multiple structural states. To address these challenges, we present ADFLIP (All-atom Discrete FLow matching Inverse Protein folding), a generative model based on discrete flow-matching for designing protein sequences conditioned on all-atom structural contexts. ADFLIP progressively incorporates predicted amino acid side chains as structural context during sequence generation and enables the design of dynamic protein complexes through ensemble sampling across multiple structural states. Furthermore, ADFLIP implements training-free classifier guidance sampling, which allows the incorporation of arbitrary pre-trained models to optimise the designed sequence for desired protein properties. We evaluated the performance of ADFLIP on protein complexes with small-molecule ligands, nucleotides, or metal ions, including dynamic complexes for which structure ensembles were determined by nuclear magnetic resonance (NMR). Our model achieves state-of-the-art performance in single-structure and multi-structure inverse folding tasks, demonstrating excellent potential for all-atom protein design. The code is available at https://github.com/ykiiiiii/ADFLIP.
中文摘要:AlphaFold3在模拟生物分子相互作用方面的突破为蛋白质设计带来新机遇,为此我们开发了ADFLIP生成模型,它通过全原子结构上下文和集成采样,在包含非蛋白质成分的动态复合物逆向折叠任务中实现了最先进的性能。
English Summary: AlphaFold3's advancements in modeling biomolecular interactions have opened new avenues for protein design, leading to the development of ADFLIP, a generative model that excels at inverse protein folding for dynamic complexes with non-protein components by incorporating all-atom structural contexts and ensemble sampling.
Authors:Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Spyros Gidaris, Elias Ramzi, Andrei Bursuc, Yuki M. Asano
Abstract:
We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.
中文摘要:Franca是首个完全开源的视觉基础模型,通过透明训练流程及多头部聚类、位置解耦等创新技术,在性能上超越主流专有模型,为高性能、可复现的人工智能设立了新标杆。
English Summary: Franca is the first fully open-source vision foundation model that outperforms leading proprietary models through a transparent training pipeline and innovative techniques like multi-head clustering and positional disentanglement, setting a new standard for high-performance, reproducible AI.
Authors:Shravan Venkatraman, Pavan Kumar S, Rakesh Raj Madavan, Chandrakala S
Abstract:
Accurate classification of computed tomography (CT) images is essential for diagnosis and treatment planning, but existing methods often struggle with the subtle and spatially diverse nature of pathological features. Current approaches typically process images uniformly, limiting their ability to detect localized abnormalities that require focused analysis. We introduce UGPL, an uncertainty-guided progressive learning framework that performs a global-to-local analysis by first identifying regions of diagnostic ambiguity and then conducting detailed examination of these critical areas. Our approach employs evidential deep learning to quantify predictive uncertainty, guiding the extraction of informative patches through a non-maximum suppression mechanism that maintains spatial diversity. This progressive refinement strategy, combined with an adaptive fusion mechanism, enables UGPL to integrate both contextual information and fine-grained details. Experiments across three CT datasets demonstrate that UGPL consistently outperforms state-of-the-art methods, achieving improvements of 3.29%, 2.46%, and 8.08% in accuracy for kidney abnormality, lung cancer, and COVID-19 detection, respectively. Our analysis shows that the uncertainty-guided component provides substantial benefits, with performance dramatically increasing when the full progressive learning pipeline is implemented. Our code is available at: https://github.com/shravan-18/UGPL
Chinese: UGPL提出了一种不确定性引导的渐进学习框架,通过识别诊断模糊区域并进行精细局部分析来提升CT图像分类效果,在多个医疗数据集上实现了显著准确率提升。
English: UGPL introduces an uncertainty-guided progressive learning framework that enhances CT image classification by identifying ambiguous regions and conducting detailed local analysis, achieving significant accuracy improvements across multiple medical datasets.
Authors:PaweÅ Budzianowski, Wesley Maa, Matthew Freed, Jingxiang Mo, Winston Hsiao, Aaron Xie, Tomasz MÅoduchowski, Viraj Tipnis, Benjamin Bolte
Abstract:
Vision-Language Models (VLMs) have emerged as a promising approach to address the data scarcity challenge in robotics, enabling the development of generalizable visuomotor control policies. While models like OpenVLA showcase the potential of this paradigm, deploying large-scale VLMs on resource-constrained mobile manipulation systems remains a significant hurdle. This paper introduces Edge VLA (EVLA), a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models. EVLA maintains the representational power of these models while enabling real-time performance on edge devices. We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs), demonstrating comparable training performance to larger models with significantly reduced computational demands. Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency. We release our model checkpoints and training \href{https://github.com/kscalelabs/evla }{codebase} to foster further research.
中文摘要:Edge VLA(EVLA)通过消除自回归位置预测并采用小型语言模型,在保持性能的同时实现了7倍推理加速,使视觉语言动作模型能在边缘设备上实时运行。
English Summary: Edge VLA (EVLA) enhances Vision-Language-Action models by eliminating autoregressive position prediction and using Small Language Models, achieving 7x faster inference while maintaining performance on edge devices.
Authors:Zhanli Wu, Fabrizio Leisen, F. Javier Rubio
Abstract:
Regression problems with bounded continuous outcomes frequently arise in real-world statistical and machine learning applications, such as the analysis of rates and proportions. A central challenge in this setting is predicting a response associated with a new covariate value. Most of the existing statistical and machine learning literature has focused either on point prediction of bounded outcomes or on interval prediction based on asymptotic approximations. We develop conformal prediction intervals for bounded outcomes based on transformation models and beta regression. We introduce tailored non-conformity measures based on residuals that are aligned with the underlying models, and account for the inherent heteroscedasticity in regression settings with bounded outcomes. We present a theoretical result on asymptotic marginal and conditional validity in the context of full conformal prediction, which remains valid under model misspecification. For split conformal prediction, we provide an empirical coverage analysis based on a comprehensive simulation study. The simulation study demonstrates that both methods provide valid finite-sample predictive coverage, including settings with model misspecification. Finally, we demonstrate the practical performance of the proposed conformal prediction intervals on real data and compare them with bootstrap-based alternatives.
中文: 本研究基于变换模型和贝塔回归开发了有界结果的保形预测区间,引入了针对异方差性设计的非一致性度量,确保即使在模型误设情况下也能提供有效的预测覆盖,并通过模拟和实际数据与自助法对比验证了其性能。
English: This study develops conformal prediction intervals for bounded outcomes using transformation models and beta regression, introducing tailored non-conformity measures to address heteroscedasticity and ensure valid predictive coverage even under model misspecification, as validated through simulations and real data comparisons with bootstrap methods.
Authors:Itay Katav, Aryeh Kontorovich
Abstract:
Modern multivariate time series forecasting primarily relies on two architectures: the Transformer with attention mechanism and Mamba. In natural language processing, an approach has been used that combines local window attention for capturing short-term dependencies and Mamba for capturing long-term dependencies, with their outputs averaged to assign equal weight to both. We find that for time-series forecasting tasks, assigning equal weight to long-term and short-term dependencies is not optimal. To mitigate this, we propose a dynamic weighting mechanism, ParallelTime Weighter, which calculates interdependent weights for long-term and short-term dependencies for each token based on the input and the model's knowledge. Furthermore, we introduce the ParallelTime architecture, which incorporates the ParallelTime Weighter mechanism to deliver state-of-the-art performance across diverse benchmarks. Our architecture demonstrates robustness, achieves lower FLOPs, requires fewer parameters, scales effectively to longer prediction horizons, and significantly outperforms existing methods. These advances highlight a promising path for future developments of parallel Attention-Mamba in time series forecasting. The implementation is readily available at: \href{https://github.com/itay1551/ParallelTime}{GitHub}.
中文: 本研究提出了ParallelTime架构,通过ParallelTime加权器动态调整时间序列预测中的长短期依赖权重,以更少参数和计算量实现了卓越性能。
English: The study introduces ParallelTime, a novel architecture that dynamically weights long-term and short-term dependencies in time series forecasting using a ParallelTime Weighter, achieving superior performance with fewer parameters and computational costs.
Authors:Pablo Marcos-Manchón, LluÃs Fuentemilla
Abstract:
A fundamental question in cognitive neuroscience is what shapes visual perception: the external world's structure or the brain's internal architecture. Although some perceptual variability can be traced to individual differences, brain responses to naturalistic stimuli evoke similar activity patterns across individuals, suggesting a convergent representational principle. Here, we test if this stimulus-driven convergence follows a common trajectory across people and deep neural networks (DNNs) during its transformation from sensory to high-level internal representations. We introduce a unified framework that traces representational flow by combining inter-subject similarity with alignment to model hierarchies. Applying this framework to three independent fMRI datasets of visual scene perception, we reveal a cortex-wide network, conserved across individuals, organized into two pathways: a medial-ventral stream for scene structure and a lateral-dorsal stream tuned for social and biological content. This functional organization is captured by the hierarchies of vision DNNs but not language models, reinforcing the specificity of the visual-to-semantic transformation. These findings show a convergent computational solution for visual encoding in both human and artificial vision, driven by the structure of the external world.
中文摘要:该研究揭示了人类大脑中存在一个保守的全皮层网络,分为处理场景结构的腹内侧通路和感知社会内容的背外侧通路,这一功能结构与深度神经网络的层级处理相吻合,表明外部世界结构驱动了视觉编码的趋同计算方案。
English Summary: This study reveals a conserved cortex-wide network in humans, organized into two visual pathways for scene structure and social content, which aligns with the hierarchical processing in deep neural networks, demonstrating a convergent computational solution for visual encoding driven by external world structure.
Authors:Kobi Hackenburg, Ben M. Tappin, Luke Hewitt, Ed Saunders, Sid Black, Hause Lin, Catherine Fist, Helen Margetts, David G. Rand, Christopher Summerfield
Abstract:
There are widespread fears that conversational AI could soon exert unprecedented influence over human beliefs. Here, in three large-scale experiments (N=76,977), we deployed 19 LLMs-including some post-trained explicitly for persuasion-to evaluate their persuasiveness on 707 political issues. We then checked the factual accuracy of 466,769 resulting LLM claims. Contrary to popular concerns, we show that the persuasive power of current and near-future AI is likely to stem more from post-training and prompting methods-which boosted persuasiveness by as much as 51% and 27% respectively-than from personalization or increasing model scale. We further show that these methods increased persuasion by exploiting LLMs' unique ability to rapidly access and strategically deploy information and that, strikingly, where they increased AI persuasiveness they also systematically decreased factual accuracy.
Chinese: 研究表明,当前对话式AI的说服力主要源于后训练和提示技术,这些方法虽大幅提升了说服效果,却系统性地降低了事实准确性,而非个性化或模型规模扩大所致。
English: Current research reveals that the persuasive power of conversational AI primarily stems from post-training and prompting techniques, which significantly enhance persuasiveness while simultaneously reducing factual accuracy, rather than from personalization or model scaling.
Authors:Lei Xu, Torkel B Brismar
Abstract:
We have developed a novel CT image analysis package named AnatomyArchive, built on top of the recent full body segmentation model TotalSegmentator. It provides automatic target volume selection and deselection capabilities according to user-configured anatomies for volumetric upper- and lower-bounds. It has a knowledge graph-based and time efficient tool for anatomy segmentation mask management and medical image database maintenance. AnatomyArchive enables automatic body volume cropping, as well as automatic arm-detection and exclusion, for more precise body composition analysis in both 2D and 3D formats. It provides robust voxel-based radiomic feature extraction, feature visualization, and an integrated toolchain for statistical tests and analysis. A python-based GPU-accelerated nearly photo-realistic segmentation-integrated composite cinematic rendering is also included. We present here its software architecture design, illustrate its workflow and working principle of algorithms as well provide a few examples on how the software can be used to assist development of modern machine learning models. Open-source codes will be released at https://github.com/lxu-medai/AnatomyArchive for only research and educational purposes.
中文摘要:AnatomyArchive是基于TotalSegmentator开发的新型CT图像分析工具包,提供自动体积选择、身体成分分析、影像组学特征提取和电影级渲染功能,旨在辅助现代机器学习模型的开发。
English Summary: AnatomyArchive is a novel CT image analysis package built on TotalSegmentator, offering automated volume selection, body composition analysis, radiomic feature extraction, and cinematic rendering to support modern machine learning model development.
Authors:Enhao Cheng, Shoujia Zhang, Jianhua Yin, Xuemeng Song, Tian Gan, Liqiang Nie
Abstract:
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook. Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches. This task is inherently challenging due to the sparse, large-scale, and high-dimensional characteristics of the short text data. Furthermore, the computational intensity required by representation learning significantly increases the running time. To address these issues, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts while identifying representative words for each cluster. Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance. GSDMM+ reduces initialization noise and adaptively adjusts word weights based on entropy, achieving fine-grained clustering that reveals more topic-related information. Additionally, strategic cluster merging is employed to refine clustering granularity, better aligning the predicted distribution with the true category distribution. We conduct extensive experiments, comparing our methods with both classical and state-of-the-art approaches. The experimental results demonstrate the efficiency and effectiveness of our methods. The source code for our model is publicly available at https://github.com/chehaoa/VEMC.
中文摘要:作者提出改进的GSDMM+短文本聚类算法,通过优化初始化、基于熵的自适应词权重调整和策略性簇合并,在实验中展现出更优的聚类效果与效率。
English Summary: The authors propose an enhanced GSDMM+ algorithm for short text clustering that improves initialization, adaptively weights words using entropy, and refines clusters through strategic merging, demonstrating superior efficiency and effectiveness in experiments.
Authors:Tongtong Su, Chengyu Wang, Bingyan Liu, Jun Huang, Dongming Lu
Abstract:
In recent years, large text-to-video (T2V) synthesis models have garnered considerable attention for their abilities to generate videos from textual descriptions. However, achieving both high imaging quality and effective motion representation remains a significant challenge for these T2V models. Existing approaches often adapt pre-trained text-to-image (T2I) models to refine video frames, leading to issues such as flickering and artifacts due to inconsistencies across frames. In this paper, we introduce EVS, a training-free Encapsulated Video Synthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness of generated videos. Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames by treating them as out-of-distribution samples, effectively optimizing them with noising and denoising steps. Meanwhile, we employ T2V backbones to ensure consistent motion dynamics. By encapsulating the T2V temporal-only prior into the T2I generation process, EVS successfully leverages the strengths of both types of models, resulting in videos of improved imaging and motion quality. Experimental results validate the effectiveness of our approach compared to previous approaches. Our composition process also leads to a significant improvement of 1.6x-4.5x speedup in inference time. Source codes: https://github.com/Tonniia/EVS.
中文: 本文提出EVS,一种无需训练的方法,通过结合文本到图像和文本到视频模型来提升视频画质和运动流畅度,同时大幅加快推理速度。
English: The paper introduces EVS, a training-free method that combines text-to-image and text-to-video models to enhance video quality and motion smoothness while significantly speeding up inference.
Authors:Zizhao Zhang, Tianxiang Zhao, Yu Sun, Liping Sun, Jichuan Kang
Abstract:
To address the challenges posed by cascading reactions caused by component failures in autonomous cargo ships (ACS) and the uncertainties in emergency decision-making, this paper proposes a novel hybrid feature fusion framework for constructing a graph-structured dataset of failure modes. By employing an improved cuckoo search algorithm (HN-CSA), the literature retrieval efficiency is significantly enhanced, achieving improvements of 7.1% and 3.4% compared to the NSGA-II and CSA search algorithms, respectively. A hierarchical feature fusion framework is constructed, using Word2Vec encoding to encode subsystem/component features, BERT-KPCA to process failure modes/reasons, and Sentence-BERT to quantify the semantic association between failure impact and emergency decision-making. The dataset covers 12 systems, 1,262 failure modes, and 6,150 propagation paths. Validation results show that the GATE-GNN model achieves a classification accuracy of 0.735, comparable to existing benchmarks. Additionally, a silhouette coefficient of 0.641 indicates that the features are highly distinguishable. In the label prediction results, the Shore-based Meteorological Service System achieved an F1 score of 0.93, demonstrating high prediction accuracy. This paper not only provides a solid foundation for failure analysis in autonomous cargo ships but also offers reliable support for fault diagnosis, risk assessment, and intelligent decision-making systems. The link to the dataset is https://github.com/wojiufukele/Graph-Structured-about-CSA.
本文提出了一种混合特征融合框架,通过改进的布谷鸟搜索算法优化数据检索,并利用GATE-GNN模型对自主货船故障模式进行分类,在故障预测和决策支持方面实现了高精度。
This paper introduces a hybrid feature fusion framework to analyze failure modes in autonomous cargo ships, using an improved cuckoo search algorithm to enhance data retrieval and a GATE-GNN model for classification, achieving high accuracy in fault prediction and decision support.
Authors:Xiao Wang, Qian Zhu, Shujuan Wu, Bo Jiang, Shiliang Zhang, Yaowei Wang, Yonghong Tian, Bin Luo
Abstract:
Recent researchers have proposed using event cameras for person re-identification (ReID) due to their promising performance and better balance in terms of privacy protection, event camera-based person ReID has attracted significant attention. Currently, mainstream event-based person ReID algorithms primarily focus on fusing visible light and event stream, as well as preserving privacy. Although significant progress has been made, these methods are typically trained and evaluated on small-scale or simulated event camera datasets, making it difficult to assess their real identification performance and generalization ability. To address the issue of data scarcity, this paper introduces a large-scale RGB-event based person ReID dataset, called EvReID. The dataset contains 118,988 image pairs and covers 1200 pedestrian identities, with data collected across multiple seasons, scenes, and lighting conditions. We also evaluate 15 state-of-the-art person ReID algorithms, laying a solid foundation for future research in terms of both data and benchmarking. Based on our newly constructed dataset, this paper further proposes a pedestrian attribute-guided contrastive learning framework to enhance feature learning for person re-identification, termed TriPro-ReID. This framework not only effectively explores the visual features from both RGB frames and event streams, but also fully utilizes pedestrian attributes as mid-level semantic features. Extensive experiments on the EvReID dataset and MARS datasets fully validated the effectiveness of our proposed RGB-Event person ReID framework. The benchmark dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID
中文摘要:本文针对事件相机行人重识别数据稀缺问题,提出了大规模RGB-事件数据集EvReID,并开发了TriPro-ReID对比学习框架,有效融合RGB帧、事件流和行人属性以提升特征学习能力。
English Summary: This paper introduces EvReID, a large-scale RGB-event person re-identification dataset addressing data scarcity issues, and proposes TriPro-ReID, a contrastive learning framework that effectively integrates RGB frames, event streams, and pedestrian attributes to enhance feature learning.
Authors:Seungjun Moon, Sangjoon Yu, Gyeong-Moon Park
Abstract:
The rapid advancement of neural radiance fields (NeRF) has paved the way to generate animatable human avatars from a monocular video. However, the sole usage of NeRF suffers from a lack of details, which results in the emergence of hybrid representation that utilizes SMPL-based mesh together with NeRF representation. While hybrid-based models show photo-realistic human avatar generation qualities, they suffer from extremely slow inference due to their deformation scheme: to be aligned with the mesh, hybrid-based models use the deformation based on SMPL skinning weights, which needs high computational costs on each sampled point. We observe that since most of the sampled points are located in empty space, they do not affect the generation quality but result in inference latency with deformation. In light of this observation, we propose EPSilon, a hybrid-based 3D avatar generation scheme with novel efficient point sampling strategies that boost both training and inference. In EPSilon, we propose two methods to omit empty points at rendering; empty ray omission (ERO) and empty interval omission (EIO). In ERO, we wipe out rays that progress through the empty space. Then, EIO narrows down the sampling interval on the ray, which wipes out the region not occupied by either clothes or mesh. The delicate sampling scheme of EPSilon enables not only great computational cost reduction during deformation but also the designation of the important regions to be sampled, which enables a single-stage NeRF structure without hierarchical sampling. Compared to existing methods, EPSilon maintains the generation quality while using only 3.9% of sampled points and achieves around 20 times faster inference, together with 4 times faster training convergence. We provide video results on https://github.com/seungjun-moon/epsilon.
Chinese Summary: EPSilon通过创新的高效点采样策略ERO和EIO,在混合三维化身生成中剔除空域采样,仅用3.9%采样点就实现20倍推理加速和4倍训练收敛提速,同时保持生成质量。
English Summary: EPSilon introduces efficient point sampling strategies, ERO and EIO, to eliminate empty space sampling in hybrid 3D avatar generation, achieving 20x faster inference and 4x quicker training while maintaining quality with only 3.9% of sampled points.
Authors:Hanbing Zheng, Chenlei Lv
Abstract:
As an important metric for mesh quality evaluation, the isotropy property holds significant value for applications such as texture UV-mapping, physical simulation, and discrete geometric analysis. Classical isotropy remeshing methods adjust vertices and edge lengths, which exhibit certain limitations in terms of input data sensitivity, geometric consistency control, and convergence speed. In this paper, we propose an improved isotropy remeshing solution with inter-angle optimization during mesh editing to enhance shape control capability and accelerate convergence. The advantage of the solution lies in its ability to predict the impact of edge length adjustments on subsequent optimization by monitoring angle transformations. It avoids inefficient editing that may cause performance fluctuations, thereby improving efficiency. Experiments demonstrate that the proposed method effectively improves the overall efficiency of mesh optimization.
中文: 本文提出的各向同性重网格方法通过在网格编辑中优化夹角来增强形状控制并加速收敛,通过预测性边长度调整有效提升了整体效率。
English: The proposed isotropy remeshing method enhances shape control and accelerates convergence by optimizing inter-angles during mesh editing, effectively improving overall efficiency through predictive edge length adjustments.
Authors:Binxiong Li, Xu Xiang, Xue Li, Binyu Zhao, Heyang Gao, Qinyu Zhao
Abstract:
In recent years, models based on Graph Convolutional Networks (GCN) have made significant strides in the field of graph data analysis. However, challenges such as over-smoothing and over-compression remain when handling large-scale and complex graph datasets, leading to a decline in clustering quality. Although the Graph Transformer architecture has mitigated some of these issues, its performance is still limited when processing heterogeneous graph data. To address these challenges, this study proposes a novel deep clustering framework that comprising GCN, Autoencoder (AE), and Graph Transformer, termed the Tri-Learn Graph Fusion Network (Tri-GFN). This framework enhances the differentiation and consistency of global and local information through a unique tri-learning mechanism and feature fusion enhancement strategy. The framework integrates GCN, AE, and Graph Transformer modules. These components are meticulously fused by a triple-channel enhancement module, which maximizes the use of both node attributes and topological structures, ensuring robust clustering representation. The tri-learning mechanism allows mutual learning among these modules, while the feature fusion strategy enables the model to capture complex relationships, yielding highly discriminative representations for graph clustering. It surpasses many state-of-the-art methods, achieving an accuracy improvement of approximately 0.87% on the ACM dataset, 14.14 % on the Reuters dataset, and 7.58 % on the USPS dataset. Due to its outstanding performance on the Reuters dataset, Tri-GFN can be applied to automatic news classification, topic retrieval, and related fields.
中文: 本研究提出Tri-GFN深度学习框架,通过融合图卷积网络、自编码器和图变换器的三重学习机制增强全局与局部信息表征,在多个数据集上实现显著性能提升,可应用于新闻自动分类等领域。
English: This study introduces Tri-GFN, a deep clustering framework that integrates GCN, Autoencoder, and Graph Transformer with a tri-learning mechanism to enhance global and local information representation, achieving superior performance on multiple datasets and enabling applications like automatic news classification.
Authors:Dmitrii Mikhailov, Aleksey Letunovskiy, Maria Kovaleva, Vladimir Arkhipkin, Vladimir Korviakov, Vladimir Polovnikov, Viacheslav Vasilev, Evelina Sidorova, Denis Dimitrov
Abstract:
Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch's Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop. The code and model weights are available here: https://github.com/gen-ai-team/Wan2.1-NABLA
中文: 本文提出NABLA,一种邻域自适应块级注意力机制,在视频扩散变换器中通过自适应稀疏阈值降低计算复杂度,同时保持生成质量,实现高达2.7倍的训练和推理加速且性能损失极小。
English: The paper introduces NABLA, a neighborhood adaptive block-level attention mechanism that reduces computational complexity in video diffusion transformers while maintaining generative quality, achieving up to 2.7x faster training and inference with minimal performance loss.
Authors:Alexander Kolpakov
Abstract:
We develop a framework for dualizing the Kolmogorov structure function $h_x(α)$, which then allows using computable complexity proxies. We establish a mathematical analogy between information-theoretic constructs and statistical mechanics, introducing a suitable partition function and free energy functional. We explicitly prove the Legendre-Fenchel duality between the structure function and free energy, showing detailed balance of the Metropolis kernel, and interpret acceptance probabilities as information-theoretic scattering amplitudes. A susceptibility-like variance of model complexity is shown to peak precisely at loss-complexity trade-offs interpreted as phase transitions. Practical experiments with linear and tree-based regression models verify these theoretical predictions, explicitly demonstrating the interplay between the model complexity, generalization, and overfitting threshold.
中文: 该研究建立了Kolmogorov结构函数的可计算对偶框架,通过复杂度方差峰值揭示相变现象,并利用回归模型验证了复杂度与泛化能力之间的权衡关系。
English: The study establishes a computable duality framework for the Kolmogorov structure function, revealing phase transitions through complexity variance peaks and validating the theory with regression models to demonstrate complexity-generalization trade-offs.
Authors:Seyyed Saeid Cheshmi, Buyao Lyu, Thomas Lisko, Rajesh Rajamani, Robert A. McGovern, Yogatheesan Varatharajah
Abstract:
Human Activity Recognition (HAR) based on wearable inertial sensors plays a critical role in remote health monitoring. In patients with movement disorders, the ability to detect abnormal patient movements in their home environments can enable continuous optimization of treatments and help alert caretakers as needed. Machine learning approaches have been proposed for HAR tasks using Inertial Measurement Unit (IMU) data; however, most rely on application-specific labels and lack generalizability to data collected in different environments or populations. To address this limitation, we propose a new cross-modal self-supervised pretraining approach to learn representations from large-sale unlabeled IMU-video data and demonstrate improved generalizability in HAR tasks on out of distribution (OOD) IMU datasets, including a dataset collected from patients with Parkinson's disease. Specifically, our results indicate that the proposed cross-modal pretraining approach outperforms the current state-of-the-art IMU-video pretraining approach and IMU-only pretraining under zero-shot and few-shot evaluations. Broadly, our study provides evidence that in highly dynamic data modalities, such as IMU signals, cross-modal pretraining may be a useful tool to learn generalizable data representations. Our software is available at https://github.com/scheshmi/IMU-Video-OOD-HAR.
中文: 本研究提出了一种利用未标记IMU-视频数据进行跨模态自监督预训练的方法,旨在提高人体活动识别的泛化能力,特别是在帕金森病患者等分布外数据集上,该方法的零样本和少样本评估表现优于当前最先进技术。
English: This study introduces a cross-modal self-supervised pretraining method using unlabeled IMU-video data to improve generalizability in human activity recognition, particularly for out-of-distribution datasets like Parkinson's disease patients, outperforming current state-of-the-art approaches in zero-shot and few-shot evaluations.
Authors:Liang Lin, Zhihao Xu, Xuehai Tang, Shi Liu, Biyu Zhou, Fuqing Zhu, Jizhong Han, Songlin Hu
Abstract:
The safety of large language models (LLMs) has garnered significant research attention. In this paper, we argue that previous empirical studies demonstrate LLMs exhibit a propensity to trust information from authoritative sources, such as academic papers, implying new possible vulnerabilities. To verify this possibility, a preliminary analysis is designed to illustrate our two findings. Based on this insight, a novel jailbreaking method, Paper Summary Attack (\llmname{PSA}), is proposed. It systematically synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template, while strategically infilling harmful query as adversarial payloads within predefined subsections. Extensive experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1. PSA achieves a 97\% attack success rate (ASR) on well-aligned models like Claude3.5-Sonnet and an even higher 98\% ASR on Deepseek-R1. More intriguingly, our work has further revealed diametrically opposed vulnerability bias across different base models, and even between different versions of the same model, when exposed to either attack-focused or defense-focused papers. This phenomenon potentially indicates future research clues for both adversarial methodologies and safety alignment.Code is available at https://github.com/233liang/Paper-Summary-Attack
Chinese: 本文提出“论文摘要攻击”(PSA)这一新型越狱方法,通过将恶意查询嵌入合成的论文摘要中,利用大语言模型对学术文献的信任,实现了高达98%的攻击成功率,并揭示了不同模型间甚至同模型不同版本间相反的脆弱性偏向。
English: This paper introduces the Paper Summary Attack (PSA), a novel jailbreaking method that exploits LLMs' trust in academic papers by embedding harmful queries into synthesized paper summaries, achieving up to 98% attack success rate and revealing contrasting vulnerability biases across models.
Authors:Joost Mertens, Joachim Denil
Abstract:
The research topic of digital twins has attracted a large amount of interest over the past decade. However, publicly available exemplars remain scarce. In the interest of open and reproducible science, in this exemplar paper we present a lab-scale gantry crane and its digital twin. The exemplar comprises both the physical and digital side of the twin system. The physical side consists of the physical crane and its controller. The digital side covers the CAD models and kinematic model of the crane, and provides services for optimal control, historical data logging, data visualization and continuous validation. We used this setup as use case in several previous publications where its functionality was validated. It is publicly available and only relies on other freely available and commonly used software, this way we hope it can be used for future research or education on the topic of digital twins.
中文: 本文介绍了一个公开的实验室门式起重机数字孪生范例,包含物理组件和用于控制与验证的数字服务,旨在支持该领域的开放式研究与教育。
English: This paper presents a publicly available digital twin exemplar of a lab-scale gantry crane, comprising both physical components and digital services for control and validation, to support open research and education in the field.
Authors:Aleksey Lapin, Igor Hromov, Stanislav Chumakov, Mile Mitrovic, Dmitry Simakov, Nikolay O. Nikitin, Andrey V. Savchenko
Abstract:
AutoML has advanced in handling complex tasks using the integration of LLMs, yet its efficiency remains limited by dependence on specific underlying tools. In this paper, we introduce LightAutoDS-Tab, a multi-AutoML agentic system for tasks with tabular data, which combines an LLM-based code generation with several AutoML tools. Our approach improves the flexibility and robustness of pipeline design, outperforming state-of-the-art open-source solutions on several data science tasks from Kaggle. The code of LightAutoDS-Tab is available in the open repository https://github.com/sb-ai-lab/LADS
中文摘要:LightAutoDS-Tab是一个多智能体AutoML系统,通过将基于大语言模型的代码生成与多种AutoML工具相结合,在表格数据任务中提升了流程设计的灵活性和鲁棒性,在多个Kaggle数据科学任务上超越了现有最优解决方案。
English Summary: LightAutoDS-Tab is a multi-agent AutoML system that enhances flexibility and robustness in tabular data tasks by integrating LLM-based code generation with multiple AutoML tools, outperforming existing solutions on Kaggle challenges.
Authors:Qingyun Sun, Jiaqi Yuan, Shan He, Xiao Guan, Haonan Yuan, Xingcheng Fu, Jianxin Li, Philip S. Yu
Abstract:
Graph Retrieval-Augmented Generation has emerged as a powerful paradigm for grounding large language models with external structured knowledge. However, existing Graph RAG methods struggle with temporal reasoning, due to their inability to model the evolving structure and order of real-world events. In this work, we introduce DyG-RAG, a novel event-centric dynamic graph retrieval-augmented generation framework designed to capture and reason over temporal knowledge embedded in unstructured text. To eliminate temporal ambiguity in traditional retrieval units, DyG-RAG proposes Dynamic Event Units (DEUs) that explicitly encode both semantic content and precise temporal anchors, enabling accurate and interpretable time-aware retrieval. To capture temporal and causal dependencies across events, DyG-RAG constructs an event graph by linking DEUs that share entities and occur close in time, supporting efficient and meaningful multi-hop reasoning. To ensure temporally consistent generation, DyG-RAG introduces an event timeline retrieval pipeline that retrieves event sequences via time-aware traversal, and proposes a Time Chain-of-Thought strategy for temporally grounded answer generation. This unified pipeline enables DyG-RAG to retrieve coherent, temporally ordered event sequences and to answer complex, time-sensitive queries that standard RAG systems cannot resolve. Extensive experiments on temporal QA benchmarks demonstrate that DyG-RAG significantly improves the accuracy and recall of three typical types of temporal reasoning questions, paving the way for more faithful and temporal-aware generation. DyG-RAG is available at https://github.com/RingBDStack/DyG-RAG.
中文摘要:DyG-RAG提出了一种动态图检索增强生成框架,通过为事件编码精确时间锚点并构建事件图实现时间感知的多跳推理,显著提升了时序问答任务的准确性。
English Summary: DyG-RAG introduces a dynamic graph retrieval-augmented generation framework that enhances temporal reasoning by encoding events with precise time anchors and constructing event graphs for time-aware multi-hop reasoning, significantly improving accuracy on temporal question-answering tasks.
Authors:Chihiro Noguchi, Takaki Yamamoto
Abstract:
Accurate perception of the surrounding environment is essential for safe autonomous driving. 3D occupancy prediction, which estimates detailed 3D structures of roads, buildings, and other objects, is particularly important for vision-centric autonomous driving systems that do not rely on LiDAR sensors. However, in 3D semantic occupancy prediction -- where each voxel is assigned a semantic label -- annotated LiDAR point clouds are required, making data acquisition costly. In contrast, large-scale binary occupancy data, which only indicate occupied or free space without semantic labels, can be collected at a lower cost. Despite their availability, the potential of leveraging such data remains unexplored. In this study, we investigate the utilization of large-scale binary occupancy data from two perspectives: (1) pre-training and (2) learning-based auto-labeling. We propose a novel binary occupancy-based framework that decomposes the prediction process into binary and semantic occupancy modules, enabling effective use of binary occupancy data. Our experimental results demonstrate that the proposed framework outperforms existing methods in both pre-training and auto-labeling tasks, highlighting its effectiveness in enhancing 3D semantic occupancy prediction. The code is available at https://github.com/ToyotaInfoTech/b2s-occupancy
中文: 本研究提出了一种新框架,通过预训练和自动标注利用低成本二元占据数据来提升3D语义占据预测性能,在视觉为中心的自动驾驶系统中优于现有方法。
English: This study introduces a novel framework that leverages low-cost binary occupancy data through pre-training and auto-labeling to enhance 3D semantic occupancy prediction, outperforming existing methods in vision-centric autonomous driving systems.
Authors:Xiaojian Lin, Wenxin Zhang, Yuchu Jiang, Wangyu Wu, Yiran Guo, Kangxu Wang, Zongzheng Zhang, Guijin Wang, Lei Jin, Hao Zhao
Abstract:
Hierarchical feature representations play a pivotal role in computer vision, particularly in object detection for autonomous driving. Multi-level semantic understanding is crucial for accurately identifying pedestrians, vehicles, and traffic signs in dynamic environments. However, existing architectures, such as YOLO and DETR, struggle to maintain feature consistency across different scales while balancing detection precision and computational efficiency. To address these challenges, we propose Butter, a novel object detection framework designed to enhance hierarchical feature representations for improving detection robustness. Specifically, Butter introduces two key innovations: Frequency-Adaptive Feature Consistency Enhancement (FAFCE) Component, which refines multi-scale feature consistency by leveraging adaptive frequency filtering to enhance structural and boundary precision, and Progressive Hierarchical Feature Fusion Network (PHFFNet) Module, which progressively integrates multi-level features to mitigate semantic gaps and strengthen hierarchical feature learning. Through extensive experiments on BDD100K, KITTI, and Cityscapes, Butter demonstrates superior feature representation capabilities, leading to notable improvements in detection accuracy while reducing model complexity. By focusing on hierarchical feature refinement and integration, Butter provides an advanced approach to object detection that achieves a balance between accuracy, deployability, and computational efficiency in real-time autonomous driving scenarios. Our model and implementation are publicly available at https://github.com/Aveiro-Lin/Butter, facilitating further research and validation within the autonomous driving community.
中文: Butter框架通过频率自适应特征一致性增强和渐进式分层特征融合网络,提升了自动驾驶目标检测中的分层特征表示能力,在多个基准测试中实现了更高的精度与效率。
English: The Butter framework enhances hierarchical feature representations in object detection for autonomous driving through its Frequency-Adaptive Feature Consistency Enhancement and Progressive Hierarchical Feature Fusion Network, achieving improved accuracy and efficiency across multiple benchmarks.
Authors:Atharv Goel, Mehar Khurana
Abstract:
Modern 3D object detection datasets are constrained by narrow class taxonomies and costly manual annotations, limiting their ability to scale to open-world settings. In contrast, 2D vision-language models trained on web-scale image-text pairs exhibit rich semantic understanding and support open-vocabulary detection via natural language prompts. In this work, we leverage the maturity and category diversity of 2D foundation models to perform open-vocabulary 3D object detection without any human-annotated 3D labels.
Our pipeline uses a 2D vision-language detector to generate text-conditioned proposals, which are segmented with SAM and back-projected into 3D using camera geometry and either LiDAR or monocular pseudo-depth. We introduce a geometric inflation strategy based on DBSCAN clustering and Rotating Calipers to infer 3D bounding boxes without training. To simulate adverse real-world conditions, we construct Pseudo-nuScenes, a fog-augmented, RGB-only variant of the nuScenes dataset.
Experiments demonstrate that our method achieves competitive localization performance across multiple settings, including LiDAR-based and purely RGB-D inputs, all while remaining training-free and open-vocabulary. Our results highlight the untapped potential of 2D foundation models for scalable 3D perception. We open-source our code and resources at https://github.com/atharv0goel/open-world-3D-det.
中文摘要:本研究提出一种无需训练的开词汇3D物体检测方法,通过利用2D视觉语言模型生成提案并结合几何策略推断3D边界框,在无人标注3D标签的情况下实现了具有竞争力的检测性能。
English Summary: This work introduces a training-free method for open-vocabulary 3D object detection by leveraging 2D vision-language models to generate proposals and geometric strategies to infer 3D bounding boxes, achieving competitive performance without human-annotated 3D labels.
Authors:Binbin Ji, Siddharth Agrawal, Qiance Tang, Yvonne Wu
Abstract:
This study investigates the spatial reasoning capabilities of vision-language models (VLMs) through Chain-of-Thought (CoT) prompting and reinforcement learning. We begin by evaluating the impact of different prompting strategies and find that simple CoT formats, where the model generates a reasoning step before the answer, not only fail to help, but can even harm the model's original performance. In contrast, structured multi-stage prompting based on scene graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy. Furthermore, to improve spatial reasoning ability, we fine-tune models using Group Relative Policy Optimization (GRPO) on the SAT dataset and evaluate their performance on CVBench. Compared to supervised fine-tuning (SFT), GRPO achieves higher accuracy on Pass@1 evaluations and demonstrates superior robustness under out-of-distribution (OOD) conditions. In particular, we find that SFT overfits to surface-level linguistic patterns and may degrade performance when test-time phrasing changes (e.g., from "closer to" to "farther from"). GRPO, on the other hand, generalizes more reliably and maintains stable performance under such shifts. Our findings provide insights into how reinforcement learning and structured prompting improve the spatial reasoning capabilities and generalization behavior of modern VLMs. All code is open source at: https://github.com/Yvonne511/spatial-vlm-investigator
中文摘要:本研究通过场景图结构化提示和GRPO强化学习方法,显著提升了视觉语言模型的空间推理准确性和泛化能力,优于简单思维链方法和监督微调。
English Summary: This research demonstrates that structured scene graph-based prompting and reinforcement learning with GRPO significantly enhance vision-language models' spatial reasoning accuracy and generalization, outperforming simple Chain-of-Thought methods and supervised fine-tuning.
Authors:Le-Anh Tran, Chung Nguyen Tran, Ngoc-Luu Nguyen, Nhan Cach Dang, Jordi Carrabina, David Castells-Rufas, Minh Son Nguyen
Abstract:
This paper introduces a novel deep learning framework for low-light image enhancement, named the Encoder-Decoder Network with Illumination Guidance (EDNIG). Building upon the U-Net architecture, EDNIG integrates an illumination map, derived from Bright Channel Prior (BCP), as a guidance input. This illumination guidance helps the network focus on underexposed regions, effectively steering the enhancement process. To further improve the model's representational power, a Spatial Pyramid Pooling (SPP) module is incorporated to extract multi-scale contextual features, enabling better handling of diverse lighting conditions. Additionally, the Swish activation function is employed to ensure smoother gradient propagation during training. EDNIG is optimized within a Generative Adversarial Network (GAN) framework using a composite loss function that combines adversarial loss, pixel-wise mean squared error (MSE), and perceptual loss. Experimental results show that EDNIG achieves competitive performance compared to state-of-the-art methods in quantitative metrics and visual quality, while maintaining lower model complexity, demonstrating its suitability for real-world applications. The source code for this work is available at https://github.com/tranleanh/ednig.
中文摘要:本文提出EDNIG深度学习框架,通过光照引导和多尺度特征提取在GAN框架中增强低光照图像,以更低复杂度实现了业界领先的增强效果。
English Summary: This paper presents EDNIG, a deep learning framework that enhances low-light images using illumination guidance and multi-scale feature extraction within a GAN framework, achieving state-of-the-art performance with reduced complexity.
Authors:Yang Zhou, Junjie Li, CongYang Ou, Dawei Yan, Haokui Zhang, Xizhe Xue
Abstract:
Due to its extensive applications, aerial image object detection has long been a hot topic in computer vision. In recent years, advancements in Unmanned Aerial Vehicles (UAV) technology have further propelled this field to new heights, giving rise to a broader range of application requirements. However, traditional UAV aerial object detection methods primarily focus on detecting predefined categories, which significantly limits their applicability. The advent of cross-modal text-image alignment (e.g., CLIP) has overcome this limitation, enabling open-vocabulary object detection (OVOD), which can identify previously unseen objects through natural language descriptions. This breakthrough significantly enhances the intelligence and autonomy of UAVs in aerial scene understanding. This paper presents a comprehensive survey of OVOD in the context of UAV aerial scenes. We begin by aligning the core principles of OVOD with the unique characteristics of UAV vision, setting the stage for a specialized discussion. Building on this foundation, we construct a systematic taxonomy that categorizes existing OVOD methods for aerial imagery and provides a comprehensive overview of the relevant datasets. This structured review enables us to critically dissect the key challenges and open problems at the intersection of these fields. Finally, based on this analysis, we outline promising future research directions and application prospects. This survey aims to provide a clear road map and a valuable reference for both newcomers and seasoned researchers, fostering innovation in this rapidly evolving domain. We keep tracing related works at https://github.com/zhouyang2002/OVOD-in-UVA-imagery
中文摘要:本文系统综述了无人机航拍图像中的开放词汇目标检测技术,通过分析现有方法、数据集及核心挑战,为通过自然语言描述实现自主场景理解提供研究路线图。
English Summary: This paper surveys open-vocabulary object detection in UAV aerial imagery, analyzing methods, datasets, and challenges to enhance autonomous scene understanding through natural language descriptions.
Authors:Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
Abstract:
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.
中文: VisionThink提出了一种动态视觉令牌压缩方法,通过自适应调整图像分辨率,在保证多数视觉问答任务性能的同时显著提升效率,并强化了OCR任务的细粒度识别能力。
English: VisionThink introduces a dynamic visual token compression method that adaptively processes images at different resolutions, enhancing efficiency while maintaining strong performance across most VQA tasks and improving fine-grained OCR capabilities.
Authors:Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, Jingzhao Zhang
Abstract:
Reinforcement learning (RL) has emerged as a central paradigm for training large language models (LLMs) in reasoning tasks. Yet recent studies question RL's ability to incentivize reasoning capacity beyond the base model. This raises a key challenge: how can RL be adapted to solve harder reasoning problems more effectively? To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 72.50% (+10.73%) on AIME24, 62.29% (+12.79%) on AIME25, and 41.67% (+10.11%) on HMMT25. Code, data and model are available at https://github.com/foreverlasting1202/QuestA.
中文: 针对强化学习在提升多步推理能力方面的不足,研究者提出QuestA方法,通过问题增强策略在训练中引入部分解决方案以降低难度并强化学习信号,该方法在数学推理基准测试中取得最优结果,并从理论上解释了其提升样本效率的机制。
English: To address reinforcement learning's limitations in improving multi-step reasoning for challenging problems, the authors propose QuestA, a question augmentation strategy that incorporates partial solutions during training to reduce difficulty and enhance learning signals, achieving state-of-the-art results on math benchmarks and providing theoretical insights into improved sample efficiency.
Authors:Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani
Abstract:
Conventional flow matching and diffusion-based policies sample through iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning mechanisms to incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA(VIsion-To-Action policy), a noise-free and conditioning-free policy learning framework that directly maps visual representations to latent actions using flow matching. VITA treats latent visual representations as the source of the flow, thus eliminating the need of conditioning. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an action autoencoder that maps raw actions into a structured latent space aligned with visual latents, trained jointly with flow matching. To further prevent latent space collapse, we propose flow latent decoding, which anchors the latent generation process by backpropagating the action reconstruction loss through the flow matching ODE (ordinary differential equations) solving steps. We evaluate VITA on 8 simulation and 2 real-world tasks from ALOHA and Robomimic. VITA outperforms or matches state-of-the-art generative policies, while achieving 1.5-2.3x faster inference compared to conventional methods with conditioning. Project page: https://ucd-dare.github.io/VITA/
中文摘要:VITA是一种创新的视觉到动作策略框架,它通过流匹配直接将视觉表征映射到潜在动作,无需迭代去噪和条件机制,在实现卓越性能的同时显著提升了推理速度。
English Summary: VITA is a novel vision-to-action policy framework that eliminates iterative denoising and conditioning mechanisms by directly mapping visual representations to latent actions through flow matching, achieving superior performance with significantly faster inference speeds.
Authors:Alicia Durrer, Florentin Bieder, Paul Friedrich, Bjoern Menze, Philippe C. Cattin, Florian Kofler
Abstract:
Healthy tissue inpainting has significant applications, including the generation of pseudo-healthy baselines for tumor growth models and the facilitation of image registration. In previous editions of the BraTS Local Synthesis of Healthy Brain Tissue via Inpainting Challenge, denoising diffusion probabilistic models (DDPMs) demonstrated qualitatively convincing results but suffered from low sampling speed. To mitigate this limitation, we adapted a 2D image generation approach, combining DDPMs with generative adversarial networks (GANs) and employing a variance-preserving noise schedule, for the task of 3D inpainting. Our experiments showed that the variance-preserving noise schedule and the selected reconstruction losses can be effectively utilized for high-quality 3D inpainting in a few time steps without requiring adversarial training. We applied our findings to a different architecture, a 3D wavelet diffusion model (WDM3D) that does not include a GAN component. The resulting model, denoted as fastWDM3D, obtained a SSIM of 0.8571, a MSE of 0.0079, and a PSNR of 22.26 on the BraTS inpainting test set. Remarkably, it achieved these scores using only two time steps, completing the 3D inpainting process in 1.81 s per image. When compared to other DDPMs used for healthy brain tissue inpainting, our model is up to 800 x faster while still achieving superior performance metrics. Our proposed method, fastWDM3D, represents a promising approach for fast and accurate healthy tissue inpainting. Our code is available at https://github.com/AliciaDurrer/fastWDM3D.
Chinese: 本研究提出的fastWDM3D模型采用三维小波扩散方法,仅需两个时间步就能完成高质量健康脑组织修复,在保持优异性能指标的同时,比先前方法提速高达800倍。
English: The study introduces fastWDM3D, a 3D wavelet diffusion model that achieves high-quality healthy brain tissue inpainting with superior performance metrics while being up to 800 times faster than previous methods by completing the process in just two time steps.
Authors:Ahmed Bahloul, Simon Malberg
Abstract:
Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree's static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree's probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems. Code available at: https://github.com/ahmedehabb/From-Roots-to-Rewards-Dynamic-Tree-Reasoning-with-RL
中文: 现代语言模型采用树状推理提升复杂问题解答能力,但静态方法存在适应性差与效率低的缺陷,新提出的动态强化学习框架通过实时构建推理树并优化行动选择策略,在保持概率严谨性的同时显著提升了求解质量与计算效率。
English: Modern language models use tree-structured reasoning to improve complex question answering, but static methods face limitations in adaptability and efficiency, which a new dynamic reinforcement learning framework addresses by adaptively constructing reasoning trees and optimizing action selection for better performance and computational economy.
Authors:Ahmed Bahloul, Simon Malberg
Abstract:
Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree's static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree's probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems. Code available at: https://github.com/ahmedehabb/From-Roots-to-Rewards-Dynamic-Tree-Reasoning-with-RL
中文: 现代语言模型采用树状推理提升复杂问题解答能力,但静态方法存在适应性差与效率低的缺陷,新提出的动态强化学习框架通过实时构建推理树并优化行动选择策略,在保持概率严谨性的同时显著提升了求解质量与计算效率。
English: Modern language models use tree-structured reasoning to improve complex question answering, but static methods face limitations in adaptability and efficiency, which a new dynamic reinforcement learning framework addresses by adaptively constructing reasoning trees and optimizing action selection for better performance and computational economy.
Authors:Xiaohan Guo, Yusong Cai, Zejia Liu, Zhengning Wang, Lili Pan, Hongliang Li
Abstract:
Enabling large-scale generative models to continuously learn new visual concepts is essential for personalizing pre-trained models to meet individual user preferences. Existing approaches for continual visual concept learning are constrained by two fundamental challenges: catastrophic forgetting and parameter expansion. In this paper, we propose Redundancy-Removal Mixture of Experts (R^2MoE), a parameter-efficient framework for lifelong visual concept learning that effectively learns new concepts while incurring minimal parameter overhead. Our framework includes three key innovative contributions: First, we propose a mixture-of-experts framework with a routing distillation mechanism that enables experts to acquire concept-specific knowledge while preserving the gating network's routing capability, thereby effectively mitigating catastrophic forgetting. Second, we propose a strategy for eliminating redundant layer-wise experts that reduces the number of expert parameters by fully utilizing previously learned experts. Third, we employ a hierarchical local attention-guided inference approach to mitigate interference between generated visual concepts. Extensive experiments have demonstrated that our method generates images with superior conceptual fidelity compared to the state-of-the-art (SOTA) method, achieving an impressive 87.8\% reduction in forgetting rates and 63.3\% fewer parameters on the CustomConcept 101 dataset. Our code is available at {https://github.com/learninginvision/R2MoE}
中文: R^2MoE框架通过路由蒸馏、冗余专家消除和分层局部注意力机制,以最小参数量实现终身视觉概念学习,在显著降低遗忘率和参数量的同时达到最优性能。
English: The proposed R^2MoE framework enables lifelong visual concept learning with minimal parameter overhead by combining routing distillation, redundant expert elimination, and hierarchical local attention, achieving state-of-the-art performance with significantly reduced forgetting rates and parameters.
Authors:Han Zhang, Xiangde Luo, Yong Chen, Kang Li
Abstract:
Annotation variability remains a substantial challenge in medical image segmentation, stemming from ambiguous imaging boundaries and diverse clinical expertise. Traditional deep learning methods producing single deterministic segmentation predictions often fail to capture these annotator biases. Although recent studies have explored multi-rater segmentation, existing methods typically focus on a single perspective -- either generating a probabilistic ``gold standard'' consensus or preserving expert-specific preferences -- thus struggling to provide a more omni view. In this study, we propose DiffOSeg, a two-stage diffusion-based framework, which aims to simultaneously achieve both consensus-driven (combining all experts' opinions) and preference-driven (reflecting experts' individual assessments) segmentation. Stage I establishes population consensus through a probabilistic consensus strategy, while Stage II captures expert-specific preference via adaptive prompts. Demonstrated on two public datasets (LIDC-IDRI and NPC-170), our model outperforms existing state-of-the-art methods across all evaluated metrics. Source code is available at https://github.com/string-ellipses/DiffOSeg .
中文摘要:DiffOSeg是一种基于扩散模型的双阶段框架,通过建立群体共识和捕捉专家特定偏好,同时实现共识驱动和偏好驱动的医学图像分割,在公开数据集上表现优于现有方法。
English Summary: DiffOSeg is a diffusion-based framework that simultaneously achieves both consensus-driven and preference-driven medical image segmentation by establishing population consensus and capturing expert-specific preferences, outperforming existing methods on public datasets.
Authors:Lefei Shen, Mouxiang Chen, Han Fu, Xiaoxue Ren, Xiaoyun Joy Wang, Jianling Sun, Zhuo Li, Chenghao Liu
Abstract:
Transformer-based models have recently become dominant in Long-term Time Series Forecasting (LTSF), yet the variations in their architecture, such as encoder-only, encoder-decoder, and decoder-only designs, raise a crucial question: What Transformer architecture works best for LTSF tasks? However, existing models are often tightly coupled with various time-series-specific designs, making it difficult to isolate the impact of the architecture itself. To address this, we propose a novel taxonomy that disentangles these designs, enabling clearer and more unified comparisons of Transformer architectures. Our taxonomy considers key aspects such as attention mechanisms, forecasting aggregations, forecasting paradigms, and normalization layers. Through extensive experiments, we uncover several key insights: bi-directional attention with joint-attention is most effective; more complete forecasting aggregation improves performance; and the direct-mapping paradigm outperforms autoregressive approaches. Furthermore, our combined model, utilizing optimal architectural choices, consistently outperforms several existing models, reinforcing the validity of our conclusions. We hope these findings offer valuable guidance for future research on Transformer architectural designs in LTSF. Our code is available at https://github.com/HALF111/TSF_architecture.
中文: 本研究提出了一种分类法来解析长期时间序列预测中Transformer的架构设计,发现采用双向联合注意力、完整预测聚合和直接映射范式的组合模型性能最优,超越了现有方法。
English: This study introduces a taxonomy to disentangle Transformer architectural designs in Long-term Time Series Forecasting, revealing that bi-directional attention with joint-attention, complete forecasting aggregation, and direct-mapping paradigms yield superior performance, with the combined optimal model outperforming existing approaches.
Authors:Yi Xin, Le Zhuo, Qi Qin, Siqi Luo, Yuewen Cao, Bin Fu, Yangfan He, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Peng Gao
Abstract:
AutoRegressive (AR) models have made notable progress in image generation, with Masked AutoRegressive (MAR) models gaining attention for their efficient parallel decoding. However, MAR models have traditionally underperformed when compared to standard AR models. This study refines the MAR architecture to improve image generation quality. We begin by evaluating various image tokenizers to identify the most effective one. Subsequently, we introduce an improved Bidirectional LLaMA architecture by replacing causal attention with bidirectional attention and incorporating 2D RoPE, which together form our advanced model, MaskGIL. Scaled from 111M to 1.4B parameters, MaskGIL achieves a FID score of 3.71, matching state-of-the-art AR models in the ImageNet 256x256 benchmark, while requiring only 8 inference steps compared to the 256 steps of AR models. Furthermore, we develop a text-driven MaskGIL model with 775M parameters for generating images from text at various resolutions. Beyond image generation, MaskGIL extends to accelerate AR-based generation and enable real-time speech-to-image conversion. Our codes and models are available at https://github.com/synbol/MaskGIL.
中文摘要:本研究提出改进的掩码自回归模型MaskGIL,仅需8次推理步骤即可达到与自回归模型相当的图像生成质量,并扩展至文本到图像和语音到图像的实时转换应用。
English Summary: This study introduces MaskGIL, an enhanced Masked AutoRegressive model that achieves state-of-the-art image generation quality with only 8 inference steps while matching AR models' performance, and extends to text-to-image and speech-to-image applications.
Authors:Zihua Zhao, Feng Hong, Mengxi Chen, Pengyi Chen, Benyuan Liu, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Abstract:
The remarkable success of contrastive-learning-based multimodal models has been greatly driven by training on ever-larger datasets with expensive compute consumption. Sample selection as an alternative efficient paradigm plays an important direction to accelerate the training process. However, recent advances on sample selection either mostly rely on an oracle model to offline select a high-quality coreset, which is limited in the cold-start scenarios, or focus on online selection based on real-time model predictions, which has not sufficiently or efficiently considered the noisy correspondence. To address this dilemma, we propose a novel Differential-Informed Sample Selection (DISSect) method, which accurately and efficiently discriminates the noisy correspondence for training acceleration. Specifically, we rethink the impact of noisy correspondence on contrastive learning and propose that the differential between the predicted correlation of the current model and that of a historical model is more informative to characterize sample quality. Based on this, we construct a robust differential-based sample selection and analyze its theoretical insights. Extensive experiments on three benchmark datasets and various downstream tasks demonstrate the consistent superiority of DISSect over current state-of-the-art methods. Source code is available at: https://github.com/MediaBrain-SJTU/DISSect.
中文: DISSect方法通过比较当前模型与历史模型的预测差异来精准识别噪声样本,在多个基准测试和下游任务中均展现出优于现有技术的训练加速效果。
English: The proposed DISSect method addresses noisy correspondence in contrastive learning by using the differential between current and historical model predictions to efficiently select high-quality samples, demonstrating superior performance across benchmarks and tasks.
Authors:Youssef Tawfilis, Hossam Amer, Minar El-Aasser, Tallal Elshabrawy
Abstract:
Federated Learning has gained increasing attention for its ability to enable multiple nodes to collaboratively train machine learning models without sharing their raw data. At the same time, Generative AI -- particularly Generative Adversarial Networks (GANs) -- have achieved remarkable success across a wide range of domains, such as healthcare, security, and Image Generation. However, training generative models typically requires large datasets and significant computational resources, which are often unavailable in real-world settings. Acquiring such resources can be costly and inefficient, especially when many underutilized devices -- such as IoT devices and edge devices -- with varying capabilities remain idle. Moreover, obtaining large datasets is challenging due to privacy concerns and copyright restrictions, as most devices are unwilling to share their data. To address these challenges, we propose a novel approach for decentralized GAN training that enables the utilization of distributed data and underutilized, low-capability devices while not sharing data in its raw form. Our approach is designed to tackle key challenges in decentralized environments, combining KLD-weighted Clustered Federated Learning to address the issues of data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to tackle the challenge of device heterogeneity under strict data sharing constraints -- ensuring that no labels or raw data, whether real or synthetic, are ever shared between nodes. Experimental results shows that our approach demonstrates consistent and significant improvements across key performance metrics, where it achieves 1.1x -- 2.2x higher image generation scores, an average 10% boost in classification metrics (up to 50% in multi-domain non-IID settings), in much lower latency compared to several benchmarks. Find our code at https://github.com/youssefga28/HuSCF-GAN.
中文: 该研究提出了一种去中心化的GAN训练方法,通过聚类联邦学习和异构分割学习技术,在不共享原始数据的情况下利用分布式数据和低性能设备,显著提升了图像生成与分类的性能指标。
English: The proposed decentralized GAN training method leverages clustered federated learning and split learning to utilize distributed data and low-capacity devices without sharing raw data, achieving significant performance improvements in image generation and classification.
Authors:Yucheng Tang, Yunguan Fu, Weixi Yi, Yipei Wang, Daniel C. Alexander, Rhodri Davies, Yipeng Hu
Abstract:
Multimodal large language models (MLLMs) can process and integrate information from multimodality sources, such as text and images. However, interrelationship among input modalities, uncertainties due to individual uni-modal data and potential clinical applications following such an uncertainty decomposition are yet fully understood in the context of large-scale MLLMs. In this work, we propose a multimodal uncertainty propagation model (MUPM) based on uncertainty propagation, to characterise the relationship among the uncertainties arising from image-only, text-only, and joint image-text variations in MLLM inputs. Using real clinical data consisting of cardiac MR scans and digital health records, we describe that MUPMs can be optimised robustly with a few samples. We then show that the fitted MUPMs are generalisable across different input data distributions and, perhaps surprisingly, across different downstream tasks. Such a transferability may be explained by the shared pretraining, comparatively light MLLM fine-tuning, along with the low-dimensional nature of the MUPMs. More importantly, this learned transferability, quantifying the relationship between these uncertainties, led to direct clinical applications in which uncertainties may be estimated and thus analysed robustly for varying data or even a novel set of cardiac disease prediction tasks. In addition, we show experimentally the efficiency in multimodal data required for estimating the overall uncertainty and its ability to identify redundant factors, both of which are considered practical yet clinically useful applications with the proposed MUPMs. Codes are available at https://github.com/yucheng722/MUPM.
中文: 本研究提出了一种多模态不确定性传播模型(MUPM),能有效表征并量化多模态大语言模型中图像与文本输入的不确定性,在数据分布和临床任务间展现出强大泛化能力,并成功应用于心脏疾病预测等实际场景。
English: This study introduces a multimodal uncertainty propagation model (MUPM) that effectively characterizes and quantifies uncertainties from individual and combined image-text inputs in MLLMs, demonstrating robust generalizability across data distributions and clinical tasks with practical applications in cardiac disease prediction.
Authors:Caixia Dong, Duwei Dai, Xinyi Han, Fan Liu, Xu Yang, Zongfang Li, Songhua Xu
Abstract:
Accurate coronary artery segmentation is critical for computeraided diagnosis of coronary artery disease (CAD), yet it remains challenging due to the small size, complex morphology, and low contrast with surrounding tissues. To address these challenges, we propose a novel segmentation framework that leverages the power of vision foundation models (VFMs) through a parallel encoding architecture. Specifically, a vision transformer (ViT) encoder within the VFM captures global structural features, enhanced by the activation of the final two ViT blocks and the integration of an attention-guided enhancement (AGE) module, while a convolutional neural network (CNN) encoder extracts local details. These complementary features are adaptively fused using a cross-branch variational fusion (CVF) module, which models latent distributions and applies variational attention to assign modality-specific weights. Additionally, we introduce an evidential-learning uncertainty refinement (EUR) module, which quantifies uncertainty using evidence theory and refines uncertain regions by incorporating multi-scale feature aggregation and attention mechanisms, further enhancing segmentation accuracy. Extensive evaluations on one in-house and two public datasets demonstrate that the proposed framework significantly outperforms state-of-the-art methods, achieving superior performance in accurate coronary artery segmentation and showcasing strong generalization across multiple datasets. The code is available at https://github.com/d1c2x3/CAseg.
中文摘要:本文提出了一种新颖的冠状动脉分割框架,通过并行编码架构融合视觉基础模型,并引入不确定性优化模块,在多个数据集上实现了卓越的分割精度和泛化能力。
English Summary: This paper introduces a novel coronary artery segmentation framework that combines vision foundation models with parallel encoding and uncertainty refinement, achieving superior accuracy and generalization across multiple datasets.
Authors:Dongyeun Lee, Jiwan Hur, Hyounguk Shon, Jae Young Lee, Junmo Kim
Abstract:
Diffusion models have achieved remarkable success in image generation but come with significant computational costs, posing challenges for deployment in resource-constrained environments. Recent post-training quantization (PTQ) methods have attempted to mitigate this issue by focusing on the iterative nature of diffusion models. However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. Learned Equivalent Scaling optimizes channel-wise scaling factors to redistribute quantization difficulty between weights and activations, reducing overall quantization error. Recognizing that early denoising steps, despite having small quantization errors, crucially impact the final output due to error accumulation, we incorporate an adaptive timestep weighting scheme to prioritize these critical steps during learning. Furthermore, identifying that layers such as skip connections exhibit high inter-channel variance, we introduce channel-wise Power-of-Two Scaling for activations. To ensure robust selection of PTS factors even with small calibration set, we introduce a voting algorithm that enhances reliability. Extensive experiments demonstrate that our method significantly outperforms existing works, especially at low bit-widths such as W4A6 (4-bit weight, 6-bit activation) and W4A8, maintaining high image generation quality and model stability. The code is available at https://github.com/LeeDongYeun/dmq.
中文: 本文提出DMQ方法,结合学习等效缩放和通道级二次幂缩放技术,通过自适应时间步加权有效处理扩散模型中的异常值和误差累积问题,在低比特量化下显著优于现有方法并保持图像生成质量。
English: This paper introduces DMQ, a novel quantization method combining Learned Equivalent Scaling and channel-wise Power-of-Two Scaling with adaptive timestep weighting to effectively handle outliers and error accumulation in diffusion models, achieving superior performance at low bit-widths while maintaining generation quality.
Authors:Tomohiro Suzuki, Ryota Tanaka, Calvin Yeung, Keisuke Fujii
Abstract:
Monocular 3D pose estimation is a promising, flexible alternative to costly motion capture systems for sports analysis. However, its practical application is hindered by two factors: a lack of realistic sports datasets and unclear reliability for sports tasks. To address these challenges, we introduce the AthleticsPose dataset, a new public dataset featuring ``real'' motions captured from 23 athletes performing various athletics events on an athletic field. Using this dataset, we trained a representative 3D pose estimation model and performed a comprehensive evaluation. Our results show that the model trained on AthleticsPose significantly outperforms a baseline model trained on an imitated sports motion dataset, reducing MPJPE by approximately 75 %. These results show the importance of training on authentic sports motion data, as models based on imitated motions do not effectively transfer to real-world motions. Further analysis reveals that estimation accuracy is sensitive to camera view and subject scale. In case studies of kinematic indicators, the model demonstrated the potential to capture individual differences in knee angles but struggled with higher-speed metrics, such as knee-drive velocity, due to prediction biases. This work provides the research community with a valuable dataset and clarifies the potential and practical limitations of using monocular 3D pose estimation for sports motion analysis. Our dataset, code, and checkpoints are available at https://github.com/SZucchini/AthleticsPose.
中文: 单目三维姿态估计为运动分析提供了一种经济灵活的替代方案,但受限于缺乏真实数据集和可靠性不明确的问题;AthleticsPose数据集通过提供真实运动数据显著提升了模型性能,同时揭示了该方法在捕捉运动特征方面的潜力与局限。
English: Monocular 3D pose estimation offers a cost-effective alternative for sports analysis but faces limitations due to the scarcity of realistic datasets and unclear reliability, which the AthleticsPose dataset addresses by providing authentic motion data that significantly improves model performance and highlights its potential and limitations in capturing sports movements.
Authors:Pavel Snopov, Oleg R. Musin
Abstract:
This study explores novel activation functions that enhance the ability of neural networks to manipulate data topology during training. Building on the limitations of traditional activation functions like $\mathrm{ReLU}$, we propose $\mathrm{SmoothSplit}$ and $\mathrm{ParametricSplit}$, which introduce topology "cutting" capabilities. These functions enable networks to transform complex data manifolds effectively, improving performance in scenarios with low-dimensional layers. Through experiments on synthetic and real-world datasets, we demonstrate that $\mathrm{ParametricSplit}$ outperforms traditional activations in low-dimensional settings while maintaining competitive performance in higher-dimensional ones. Our findings highlight the potential of topology-aware activation functions in advancing neural network architectures. The code is available via https://github.com/Snopoff/Topology-Aware-Activations.
中文摘要:本研究提出具有拓扑感知能力的激活函数SmoothSplit和ParametricSplit,使神经网络能够在训练中有效处理数据拓扑结构,在低维场景中表现优异,同时在高维环境中保持竞争力。
English Summary: This research introduces topology-aware activation functions, SmoothSplit and ParametricSplit, which enable neural networks to effectively manipulate data topology during training, demonstrating superior performance in low-dimensional settings while maintaining competitiveness in higher dimensions.
Authors:Shiqi Huang, Shuting He, Huaiyuan Qin, Bihan Wen
Abstract:
Most existing remote sensing instance segmentation approaches are designed for close-vocabulary prediction, limiting their ability to recognize novel categories or generalize across datasets. This restricts their applicability in diverse Earth observation scenarios. To address this, we introduce open-vocabulary (OV) learning for remote sensing instance segmentation. While current OV segmentation models perform well on natural image datasets, their direct application to remote sensing faces challenges such as diverse landscapes, seasonal variations, and the presence of small or ambiguous objects in aerial imagery. To overcome these challenges, we propose $\textbf{SCORE}$ ($\textbf{S}$cene $\textbf{C}$ontext matters in $\textbf{O}$pen-vocabulary $\textbf{RE}$mote sensing instance segmentation), a framework that integrates multi-granularity scene context, i.e., regional context and global context, to enhance both visual and textual representations. Specifically, we introduce Region-Aware Integration, which refines class embeddings with regional context to improve object distinguishability. Additionally, we propose Global Context Adaptation, which enriches naive text embeddings with remote sensing global context, creating a more adaptable and expressive linguistic latent space for the classifier. We establish new benchmarks for OV remote sensing instance segmentation across diverse datasets. Experimental results demonstrate that, our proposed method achieves SOTA performance, which provides a robust solution for large-scale, real-world geospatial analysis. Our code is available at https://github.com/HuangShiqi128/SCORE.
中文摘要:提出的SCORE框架通过整合多粒度场景上下文来增强视觉和文本表示,解决了现有遥感实例分割方法的局限性,在面向多样化地球观测场景的开放词汇学习中实现了最先进的性能。
English Summary: The proposed SCORE framework addresses the limitations of existing remote sensing instance segmentation methods by integrating multi-granularity scene context to enhance both visual and textual representations, achieving state-of-the-art performance in open-vocabulary learning for diverse Earth observation scenarios.
Authors:Ziyi Wang, Zhi Gao, Jin Chen, Qingjie Zhao, Xinxiao Wu, Jiebo Luo
Abstract:
Domain generalization (DG) aims to learn a model from source domains and apply it to unseen target domains with out-of-distribution data. Owing to CLIP's strong ability to encode semantic concepts, it has attracted increasing interest in domain generalization. However, CLIP often struggles to focus on task-relevant regions across domains, i.e., domain-invariant regions, resulting in suboptimal performance on unseen target domains. To address this challenge, we propose an attention-refocusing scheme, called Simulate, Refocus and Ensemble (SRE), which learns to reduce the domain shift by aligning the attention maps in CLIP via attention refocusing. SRE first simulates domain shifts by performing augmentation on the source data to generate simulated target domains. SRE then learns to reduce the domain shifts by refocusing the attention in CLIP between the source and simulated target domains. Finally, SRE utilizes ensemble learning to enhance the ability to capture domain-invariant attention maps between the source data and the simulated target data. Extensive experimental results on several datasets demonstrate that SRE generally achieves better results than state-of-the-art methods. The code is available at: https://github.com/bitPrincy/SRE-DG.
Chinese: 针对CLIP在领域泛化中难以聚焦跨领域不变区域的问题,本文提出的SRE方法通过模拟领域偏移和注意力重聚焦机制,结合集成学习显著提升了模型在未知目标领域的性能表现。
English: To address CLIP's limitation in focusing on domain-invariant regions for domain generalization, the proposed SRE method employs attention refocusing through domain shift simulation and ensemble learning, demonstrating superior performance across multiple datasets.
Authors:Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong
Abstract:
The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce MCPEval, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.
中文摘要:MCPEval是一个基于模型上下文协议的开源框架,能自动化评估多领域大语言模型智能代理,通过标准化指标和消除人工操作,有效揭示领域特异性表现。
English Summary: MCPEval is an open-source framework that automates comprehensive evaluation of LLM agents across multiple domains, standardizing metrics and eliminating manual effort while demonstrating effectiveness in revealing domain-specific performance.
Authors:Hoang-Son Vo, Quang-Vinh Nguyen, Seungwon Kim, Hyung-Jeong Yang, Soonja Yeom, Soo-Hyung Kim
Abstract:
Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical communication, and digital platforms. The source code is available at: \href{https://github.com/sonvth/ATL-Diff}{https://github.com/sonvth/ATL-Diff}
中文: 本文提出ATL-Diff这一创新方法,通过关键点引导噪声分布和3D身份扩散网络,在提升音频驱动说话头部生成的同步性、降低噪声和计算成本的同时,有效保留面部特征,实现了卓越性能与近实时处理。
English: This paper presents ATL-Diff, an innovative audio-driven talking head generation method that enhances synchronization, reduces noise and computational costs, and preserves facial identity through landmark-guided noise distribution and a 3D identity diffusion network, achieving superior performance and near real-time processing.
Authors:Qianru Zhang, Chenglei Yu, Haixin Wang, Yudong Yan, Yuansheng Cao, Siu-Ming Yiu, Tailin Wu, Hongzhi Yin
Abstract:
Time series prediction, a crucial task across various domains, faces significant challenges due to the inherent complexities of time series data, including non-stationarity, multi-scale periodicity, and transient dynamics, particularly when tackling long-term predictions. While Transformer-based architectures have shown promise, their quadratic complexity with sequence length hinders their efficiency for long-term predictions. Recent advancements in State-Space Models, such as Mamba, offer a more efficient alternative for long-term modeling, but they cannot capture multi-scale periodicity and transient dynamics effectively. Meanwhile, they are susceptible to data noise issues in time series. This paper proposes a novel framework, FLDmamba (Fourier and Laplace Transform Decomposition Mamba), addressing these limitations. FLDmamba leverages the strengths of both Fourier and Laplace transforms to effectively capture both multi-scale periodicity, transient dynamics within time series data, and improve the robustness of the model to the data noise issue. Our extensive experiments demonstrate that FLDmamba achieves superior performance on time series prediction benchmarks, outperforming both Transformer-based and other Mamba-based architectures. To promote the reproducibility of our method, we have made both the code and data accessible via the following URL:{\href{https://github.com/AI4Science-WestlakeU/FLDmamba}{https://github.com/AI4Science-WestlakeU/\model}.
中文: 本文提出FLDmamba框架,结合傅里叶和拉普拉斯变换有效捕捉时间序列中的多尺度周期性和瞬态动态,增强了对数据噪声的鲁棒性,在基准测试中超越了现有的Transformer和Mamba模型。
English: This paper introduces FLDmamba, a novel framework that combines Fourier and Laplace transforms to effectively capture multi-scale periodicity and transient dynamics in time series data, enhancing robustness against noise and outperforming existing Transformer and Mamba models in benchmarks.
Authors:Jikai Wang, Yunqi Cheng, Zonghai Chen
Abstract:
Though visual and repeat navigation is a convenient solution for mobile robot self-navigation, achieving balance between efficiency and robustness in task environment still remains challenges. In this paper, we propose a novel visual and repeat robotic autonomous navigation method that requires no accurate localization and dense reconstruction modules, which makes our system featured by lightweight and robustness. Firstly, feature flow is introduced and we develop a qualitative mapping between feature flow and robot's motion, in which feature flow is defined as pixel location bias between matched features. Based on the mapping model, the map outputted by the teaching phase is represented as a keyframe graph, in which the feature flow on the edge encodes the relative motion between adjacent keyframes. Secondly, the visual repeating navigation is essentially modeled as a feature flow minimization problem between current observation and the map keyframe. To drive the robot to consistently reduce the feature flow between current frame and map keyframes without accurate localization, a probabilistic motion planning is developed based on our qualitative feature flow-motion mapping indicator. Extensive experiments using our mobile platform demonstrates that our proposed method is lightweight, robust, and superior to baselines. The source code has been made public at https://github.com/wangjks/FFI-VTR to benefit the community.
中文摘要:本文提出了一种轻量级且鲁棒的视觉导航方法,通过特征流映射和概率运动规划,使移动机器人在无需精确定位和密集重建的情况下实现高效导航。
English Summary: This paper introduces a lightweight and robust visual navigation method for mobile robots that uses feature flow mapping and probabilistic motion planning to enable efficient navigation without precise localization or dense reconstruction.
Authors:Junjie Gao, Runze Liu, Yingzhe Peng, Shujian Yang, Jin Zhang, Kai Yang, Zhiyuan You
Abstract:
Document quality assessment is critical for a wide range of applications including document digitization, OCR, and archival. However, existing approaches often struggle to provide accurate and robust quality scores, limiting their applicability in practical scenarios. With the rapid progress in Multi-modal Large Language Models (MLLMs), recent MLLM-based methods have achieved remarkable performance in image quality assessment. In this work, we extend this success to the document domain by adapting DeQA-Score, a state-of-the-art MLLM-based image quality scorer, for document quality assessment. We propose DeQA-Doc, a framework that leverages the visual language capabilities of MLLMs and a soft label strategy to regress continuous document quality scores. To adapt DeQA-Score to DeQA-Doc, we adopt two complementary solutions to construct soft labels without the variance information. Also, we relax the resolution constrains to support the large resolution of document images. Finally, we introduce ensemble methods to further enhance the performance. Extensive experiments demonstrate that DeQA-Doc significantly outperforms existing baselines, offering accurate and generalizable document quality assessment across diverse degradation types. Codes and model weights are available in https://github.com/Junjie-Gao19/DeQA-Doc.
中文摘要:本文提出了DeQA-Doc框架,通过采用多模态大语言模型的视觉语言能力和软标签策略来评估文档质量,在多种退化类型上显著超越了现有基线方法。
English Summary: This paper introduces DeQA-Doc, a framework that adapts a state-of-the-art multimodal large language model for document quality assessment using soft labels and ensemble methods, significantly outperforming existing baselines across various degradation types.
Authors:Peijun Wang, Jinhua Zhao
Abstract:
Small object detection remains a challenging problem in the field of object detection. To address this challenge, we propose an enhanced YOLOv8-based model, SOD-YOLO. This model integrates an ASF mechanism in the neck to enhance multi-scale feature fusion, adds a Small Object Detection Layer (named P2) to provide higher-resolution feature maps for better small object detection, and employs Soft-NMS to refine confidence scores and retain true positives. Experimental results demonstrate that SOD-YOLO significantly improves detection performance, achieving a 36.1% increase in mAP$_{50:95}$ and 20.6% increase in mAP$_{50}$ on the VisDrone2019-DET dataset compared to the baseline model. These enhancements make SOD-YOLO a practical and efficient solution for small object detection in UAV imagery. Our source code, hyper-parameters, and model weights are available at https://github.com/iamwangxiaobai/SOD-YOLO.
Chinese: 提出的SOD-YOLO模型通过融合多尺度特征、增加高分辨率检测层和使用Soft-NMS,显著提升了小目标检测性能,在VisDrone2019-DET数据集上mAP$_{50:95}$和mAP$_{50}$分别提高了36.1%和20.6%。
English: The proposed SOD-YOLO model enhances small object detection by integrating multi-scale feature fusion, adding a high-resolution detection layer, and employing Soft-NMS, achieving significant performance improvements of 36.1% in mAP$_{50:95}$ and 20.6% in mAP$_{50}$ on the VisDrone2019-DET dataset.
Authors:Abraham Toluase Owodunni, Orevaoghene Ahia, Sachin Kumar
Abstract:
Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at https://github.com/owos/flexitokens
中文: 本研究提出FLEXITOKENS方法,通过可学习的分词器实现字节级语言模型的自适应分词,有效减少过度碎片化,相比传统子词分词器在多语言和多样化任务中性能提升高达10%。
English: The study introduces FLEXITOKENS, a method that enables adaptive tokenization in byte-level language models to reduce overfragmentation, achieving up to 10% performance gains on multilingual and diverse tasks compared to rigid subword tokenizers.
Authors:Abraham Toluwase Owodunni, Orevaoghene Ahia, Sachin Kumar
Abstract:
Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at https://github.com/owos/flexitokens
中文: 本研究提出FLEXITOKENS方法,通过可学习的分词器实现字节级语言模型的自适应分词,有效减少过度碎片化,相比传统子词分词器在多语言和多样化任务中性能提升高达10%。
English: The study introduces FLEXITOKENS, a method that enables adaptive tokenization in byte-level language models to reduce overfragmentation, achieving up to 10% performance gains on multilingual and diverse tasks compared to rigid subword tokenizers.
Authors:Rajesh Sureddi, Saman Zadtootaghaj, Nabajeet Barman, Alan C. Bovik
Abstract:
Image Quality Assessment (IQA) models aim to predict perceptual image quality in alignment with human judgments. No-Reference (NR) IQA remains particularly challenging due to the absence of a reference image. While deep learning has significantly advanced this field, a major hurdle in developing NR-IQA models is the limited availability of subjectively labeled data. Most existing deep learning-based NR-IQA approaches rely on pre-training on large-scale datasets before fine-tuning for IQA tasks. To further advance progress in this area, we propose a novel approach that constructs a custom dataset using a limited number of reference content images and introduces a no-reference IQA model that incorporates both content and quality features for perceptual quality prediction. Specifically, we train a quality-aware model using contrastive triplet-based learning, enabling efficient training with fewer samples while achieving strong generalization performance across publicly available datasets. Our repository is available at https://github.com/rajeshsureddi/triqa.
Chinese: 本研究提出了一种新的无参考图像质量评估模型,通过对比三元组学习和定制数据集,利用有限样本有效预测感知质量,并在公开数据集上展现出强大的泛化性能。
English: This study introduces a novel no-reference image quality assessment (NR-IQA) model that leverages contrastive triplet-based learning and a custom dataset to effectively predict perceptual quality with limited training samples, demonstrating strong generalization across public datasets.
Authors:Christina Thrainer, Md Meftahul Ferdaus, Mahdi Abdelguerfi, Christian Guetl, Steven Sloan, Kendall N. Niles, Ken Pathak
Abstract:
Automated structural defect segmentation in civil infrastructure faces a critical challenge: achieving high accuracy while maintaining computational efficiency for real-time deployment. This paper presents FORTRESS (Function-composition Optimized Real-Time Resilient Structural Segmentation), a new architecture that balances accuracy and speed by using a special method that combines depthwise separable convolutions with adaptive Kolmogorov-Arnold Network integration. FORTRESS incorporates three key innovations: a systematic depthwise separable convolution framework achieving a 3.6x parameter reduction per layer, adaptive TiKAN integration that selectively applies function composition transformations only when computationally beneficial, and multi-scale attention fusion combining spatial, channel, and KAN-enhanced features across decoder levels. The architecture achieves remarkable efficiency gains with 91% parameter reduction (31M to 2.9M), 91% computational complexity reduction (13.7 to 1.17 GFLOPs), and 3x inference speed improvement while delivering superior segmentation performance. Evaluation on benchmark infrastructure datasets demonstrates state-of-the-art results with an F1- score of 0.771 and a mean IoU of 0.677, significantly outperforming existing methods including U-Net, SA-UNet, and U- KAN. The dual optimization strategy proves essential for optimal performance, establishing FORTRESS as a robust solution for practical structural defect segmentation in resource-constrained environments where both accuracy and computational efficiency are paramount. Comprehensive architectural specifications are provided in the Supplemental Material. Source code is available at URL: https://github.com/faeyelab/fortress-paper-code.
中文: 本文提出FORTRESS架构,通过深度可分离卷积与自适应KAN集成,在保持高精度的同时大幅降低参数与计算量,实现了实时结构缺陷分割的最优性能。
English: This paper introduces FORTRESS, an efficient architecture for structural defect segmentation that significantly reduces parameters and computational complexity while achieving superior accuracy and real-time performance.
Authors:Mihran Miroyan, Rose Niousha, Joseph E. Gonzalez, Gireeja Ranade, Narges Norouzi
Abstract:
Large Language Models (LLMs) have shown strong performance on programming tasks, but can they generate student-like code like real students - imperfect, iterative, and stylistically diverse? We present ParaStudent, a systematic study of LLM-based "student-like" code generation in an introductory programming course setting. Using a dataset of timestamped student submissions across multiple semesters, we design low- and high-resolution experiments to model student progress and evaluate code outputs along semantic, functional, and stylistic dimensions. Our results show that fine-tuning significantly improves alignment with real student trajectories and captures error patterns, incremental improvements, and stylistic variations more faithfully. This study shows that modeling realistic student code requires capturing learning dynamics through context-aware generation, temporal modeling, and multi-dimensional evaluation. Code for experiments and evaluation is available at https://github.com/mmiroyan/ParaStudent.
中文摘要:本研究提出ParaStudent,证明经过微调的大语言模型能通过时序建模和多维评估,有效模拟学生编程中的学习动态、错误模式和风格变化,生成更贴近真实学生的代码。
English Summary: This study introduces ParaStudent, demonstrating that fine-tuned Large Language Models can generate student-like code by capturing learning dynamics, error patterns, and stylistic variations through temporal modeling and multi-dimensional evaluation.
Authors:Athanasios Papastathopoulos-Katsaros, Alexandra Stavrianidi, Zhandong Liu
Abstract:
Physics-Informed Neural Networks (PINNs) are deep learning models that incorporate the governing physical laws of a system into the learning process, making them well-suited for solving complex scientific and engineering problems. Recently, PINNs have gained widespread attention as a powerful framework for combining physical principles with data-driven modeling to improve prediction accuracy. Despite their successes, however, PINNs often exhibit poor extrapolation performance outside the training domain and are highly sensitive to the choice of activation functions (AFs). In this paper, we introduce a transfer learning (TL) method to improve the extrapolation capability of PINNs. Our approach applies transfer learning (TL) within an extended training domain, using only a small number of carefully selected collocation points. Additionally, we propose an adaptive AF that takes the form of a linear combination of standard AFs, which improves both the robustness and accuracy of the model. Through a series of experiments, we demonstrate that our method achieves an average of 40% reduction in relative L2 error and an average of 50% reduction in mean absolute error in the extrapolation domain, all without a significant increase in computational cost. The code is available at https://github.com/LiuzLab/PINN-extrapolation .
中文: 本文提出一种迁移学习方法与自适应激活函数,有效提升了物理信息神经网络的泛化能力,在未显著增加计算成本的情况下大幅降低了外推误差。
English: This paper introduces a transfer learning method and an adaptive activation function to enhance the extrapolation performance of Physics-Informed Neural Networks, achieving significant error reductions without substantially increasing computational costs.
Authors:Said Ohamouddou, Abdellatif El Afia, Hanaa El Afia, Raddouane Chiheb
Abstract:
Tree species classification from terrestrial LiDAR point clouds is challenging because of the complex multi-scale geometric structures in forest environments. Existing approaches using multi-scale dynamic graph convolutional neural networks (MS-DGCNN) employ parallel multi-scale processing, which fails to capture the semantic relationships between the hierarchical levels of the tree architecture. We present MS-DGCNN++, a hierarchical multiscale fusion dynamic graph convolutional network that uses semantically meaningful feature extraction at local, branch, and canopy scales with cross-scale information propagation. Our method employs scale-specific feature engineering, including standard geometric features for the local scale, normalized relative vectors for the branch scale, and distance information for the canopy scale. This hierarchical approach replaces uniform parallel processing with semantically differentiated representations that are aligned with the natural tree structure. Under the same proposed tree species data augmentation strategy for all experiments, MS-DGCNN++ achieved an accuracy of 94.96 \% on STPCTLS, outperforming DGCNN, MS-DGCNN, and the state-of-the-art model PPT. On FOR-species20K, it achieves 67.25\% accuracy (6.1\% improvement compared to MS-DGCNN). For standard 3D object recognition, our method outperformed DGCNN and MS-DGCNN with overall accuracies of 93.15\% on ModelNet40 and 94.05\% on ModelNet10. With lower parameters and reduced complexity compared to state-of-the-art transformer approaches, our method is suitable for resource-constrained applications while maintaining a competitive accuracy. Beyond tree classification, the method generalizes to standard 3D object recognition, establishing it as a versatile solution for diverse point cloud processing applications. The implementation code is publicly available at https://github.com/said-ohamouddou/MS-DGCNN2.
Chinese: MS-DGCNN++ 提出了一种层次化多尺度融合网络,通过捕捉局部、枝干和冠层尺度的语义关系,提升了从LiDAR数据中进行树种分类的准确性,并以更低的复杂度实现了卓越性能。
English: MS-DGCNN++ introduces a hierarchical multiscale fusion network that enhances tree species classification from LiDAR data by capturing semantic relationships across local, branch, and canopy scales, achieving superior accuracy with reduced complexity.
Authors:Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai
Abstract:
This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.
中文: 本文提出Mono-InternVL这一单模态多模态大语言模型,通过创新的视觉参数空间和内生视觉预训练方法解决优化不稳定与灾难性遗忘问题,在多个基准测试中表现优异并显著降低延迟。
English: This paper introduces Mono-InternVL, a monolithic Multimodal Large Language Model that addresses optimization instability and catastrophic forgetting through a novel visual parameter space and Endogenous Visual Pre-training, achieving superior performance on multiple benchmarks while reducing latency.
Authors:Richard Marcus, Marc Stamminger
Abstract:
Methods for Novel View Synthesis (NVS) have recently found traction in the field of LiDAR simulation and large-scale 3D scene reconstruction. While solutions for faster rendering or handling dynamic scenes have been proposed, LiDAR specific effects remain insufficiently addressed. By explicitly modeling sensor characteristics such as rolling shutter, laser power variations, and intensity falloff, our method achieves more accurate LiDAR simulation compared to existing techniques. We demonstrate the effectiveness of our approach through quantitative and qualitative comparisons with state-of-the-art methods, as well as ablation studies that highlight the importance of each sensor model component. Beyond that, we show that our approach exhibits advanced resimulation capabilities, such as generating high resolution LiDAR scans in the camera perspective.
Our code and the resulting dataset are available at https://github.com/richardmarcus/PBNLiDAR.
Chinese: 我们的方法通过模拟滚动快门和激光功率等传感器特性,提高了LiDAR仿真的准确性,在定量比较和生成相机视角高分辨率扫描等高级应用中均优于现有技术。
English: Our method enhances LiDAR simulation accuracy by modeling sensor characteristics like rolling shutter and laser power, outperforming existing techniques in both quantitative comparisons and advanced applications such as generating high-resolution scans from camera perspectives.
Authors:Dong Wang, Hanmo You, Lingwei Zhu, Kaiwei Lin, Zheng Chen, Chen Yang, Junji Yu, Zan Wang, Junjie Chen
Abstract:
Reinforcement Learning (RL) has emerged as a powerful paradigm for sequential decision-making and has attracted growing interest across various domains, particularly following the advent of Deep Reinforcement Learning (DRL) in 2015. Simultaneously, the rapid advancement of Large Language Models (LLMs) has further fueled interest in integrating RL with LLMs to enable more adaptive and intelligent systems. In the field of software engineering (SE), the increasing complexity of systems and the rising demand for automation have motivated researchers to apply RL to a broad range of tasks, from software design and development to quality assurance and maintenance. Despite growing research in RL-for-SE, there remains a lack of a comprehensive and systematic survey of this evolving field. To address this gap, we reviewed 115 peer-reviewed studies published across 22 premier SE venues since the introduction of DRL. We conducted a comprehensive analysis of publication trends, categorized SE topics and RL algorithms, and examined key factors such as dataset usage, model design and optimization, and evaluation practices. Furthermore, we identified open challenges and proposed future research directions to guide and inspire ongoing work in this evolving area. To summarize, this survey offers the first systematic mapping of RL applications in software engineering, aiming to support both researchers and practitioners in navigating the current landscape and advancing the field. Our artifacts are publicly available: https://github.com/KaiWei-Lin-lanina/RL4SE.
中文: 本综述首次系统梳理了强化学习在软件工程中的应用,通过分析115项研究对主题分类、算法使用和评估实践进行全面总结,并提出了该领域的未来研究方向。
English: This survey provides the first systematic mapping of reinforcement learning applications in software engineering, analyzing 115 studies to categorize topics, algorithms, and evaluation practices while identifying future research directions.
Authors:Ishraq Khan, Assad Chowdary, Sharoz Haseeb, Urvish Patel, Yousuf Zaii
Abstract:
Large Language Models (LLMs) have improved code generation and software automation, but remain limited by inference-time context and lack structured reasoning over code. Debugging remains unsolved despite these advances. While Claude Opus 4 and GPT-4.1 achieve >70% on code synthesis benchmarks, they perform <15% on real debugging tasks. We introduce Kodezi Chronos, a language model built specifically for debugging. Chronos combines Adaptive Graph-Guided Retrieval to navigate codebases up to 10 million lines using multi-hop traversal (92% precision, 85% recall), Persistent Debug Memory trained on 15M+ sessions, and a 7-layer architecture for iterative fix-test-refine loops. On 5,000 real-world scenarios, Chronos achieves 67.3% fix accuracy, compared to 14.2% and 13.8% for Claude and GPT-4.1 respectively. Chronos reduces debugging time by 40% and iteration count by 65%. It resolves complex multi-file bugs involving cross-repository context and temporal reasoning. Key limitations include 23.4% success on hardware-dependent issues and 41.2% on dynamic language errors. Theoretical analysis shows O(k log d) retrieval complexity with convergence guarantees. In a human evaluation (N=50), 89% of participants preferred Chronos over baseline models. Chronos will be available in Kodezi OS in Q4 2025 and via API in Q1 2026.
中文: Kodezi Chronos 作为专用于调试的语言模型,通过自适应代码库导航和持久调试记忆实现了67.3%的修复准确率,在5000个实际场景中显著优于Claude和GPT-4.1等通用模型,并将调试时间减少40%。
English: Kodezi Chronos is a specialized debugging language model that achieves 67.3% fix accuracy through adaptive codebase navigation and persistent debug memory, significantly outperforming general models like Claude and GPT-4.1 while reducing debugging time by 40%.
Authors:Muhammed Furkan Dasdelen, Hyesu Lim, Michele Buck, Katharina S. Götze, Carsten Marr, Steffen Schneider
Abstract:
Sparse autoencoders (SAEs) emerged as a promising tool for mechanistic interpretability of transformer-based foundation models. Very recently, SAEs were also adopted for the visual domain, enabling the discovery of visual concepts and their patch-wise attribution to tokens in the transformer model. While a growing number of foundation models emerged for medical imaging, tools for explaining their inferences are still lacking. In this work, we show the applicability of SAEs for hematology. We propose CytoSAE, a sparse autoencoder which is trained on over 40,000 peripheral blood single-cell images. CytoSAE generalizes to diverse and out-of-domain datasets, including bone marrow cytology, where it identifies morphologically relevant concepts which we validated with medical experts. Furthermore, we demonstrate scenarios in which CytoSAE can generate patient-specific and disease-specific concepts, enabling the detection of pathognomonic cells and localized cellular abnormalities at the patch level. We quantified the effect of concepts on a patient-level AML subtype classification task and show that CytoSAE concepts reach performance comparable to the state-of-the-art, while offering explainability on the sub-cellular level. Source code and model weights are available at https://github.com/dynamical-inference/cytosae.
中文: 本研究提出CytoSAE稀疏自编码器,基于4万余张血细胞图像训练,可识别并验证血液学中具有临床意义的视觉概念,实现可解释的疾病分类与细胞异常检测。
English: This study introduces CytoSAE, a sparse autoencoder trained on over 40,000 blood cell images, which identifies and validates clinically relevant visual concepts for hematology, enabling explainable disease classification and cellular abnormality detection.
Authors:Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, Xiaowei Zhou
Abstract:
We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50$\times$ faster.
中文:SpatialTrackerV2是一种前馈式三维点跟踪方法,通过端到端架构统一场景几何、相机运动和物体位移,性能比现有方法提升30%,在保持顶尖动态重建精度的同时处理速度加快50倍。
English: SpatialTrackerV2 is a unified, feed-forward 3D point tracking method that integrates scene geometry, camera motion, and object movement into an end-to-end architecture, achieving 30% higher performance than existing methods and matching top dynamic reconstruction accuracy with 50x faster processing.
Authors:Shangpin Peng, Senqiao Yang, Li Jiang, Zhuotao Tian
Abstract:
Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at https://github.com/pspdada/SENTINEL.
中文摘要:SENTINEL框架通过在文本生成的早期阶段进行句子级干预,利用自举构建的领域内偏好数据训练模型,有效减少了多模态大语言模型中的幻觉现象,在无需人工标注的情况下将幻觉降低超过90%,并在各项基准测试中保持卓越性能。
English Summary: The SENTINEL framework effectively reduces hallucinations in multimodal language models by detecting and correcting fabricated content at the sentence level during early text generation, achieving over 90% reduction without human annotations while maintaining superior performance on benchmark tests.
Authors:Yen-Linh Vu, Dinh-Thang Duong, Truong-Binh Duong, Anh-Khoi Nguyen, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Jianhua Xing, Xingjian Li, Tianyang Wang, Ulas Bagci, Min Xu
Abstract:
Recent progress has been made in region-aware vision-language modeling, particularly with the emergence of the Describe Anything Model (DAM). DAM is capable of generating detailed descriptions of any specific image areas or objects without the need for additional localized image-text alignment supervision. We hypothesize that such region-level descriptive capability is beneficial for the task of Visual Question Answering (VQA), especially in challenging scenarios involving images with dense text. In such settings, the fine-grained extraction of textual information is crucial to producing correct answers. Motivated by this, we introduce DAM-QA, a framework with a tailored evaluation protocol, developed to investigate and harness the region-aware capabilities from DAM for the text-rich VQA problem that requires reasoning over text-based information within images. DAM-QA incorporates a mechanism that aggregates answers from multiple regional views of image content, enabling more effective identification of evidence that may be tied to text-related elements. Experiments on six VQA benchmarks show that our approach consistently outperforms the baseline DAM, with a notable 7+ point gain on DocVQA. DAM-QA also achieves the best overall performance among region-aware models with fewer parameters, significantly narrowing the gap with strong generalist VLMs. These results highlight the potential of DAM-like models for text-rich and broader VQA tasks when paired with efficient usage and integration strategies. Our code is publicly available at https://github.com/Linvyl/DAM-QA.git.
中文摘要:DAM-QA框架利用描述任意模型(DAM)的区域感知能力,通过聚合图像多个区域的答案来增强视觉问答,尤其在文本密集图像上表现卓越,以更少参数在多项基准测试中取得领先性能。
English Summary: The DAM-QA framework leverages the region-aware capabilities of the Describe Anything Model to enhance Visual Question Answering, particularly for text-rich images, by aggregating answers from multiple regional views and achieving superior performance on benchmarks with fewer parameters.
Authors:Chandana Cheerla
Abstract:
Organizations increasingly rely on proprietary enterprise data, including HR records, structured reports, and tabular documents, for critical decision-making. While Large Language Models (LLMs) have strong generative capabilities, they are limited by static pretraining, short context windows, and challenges in processing heterogeneous data formats. Conventional Retrieval-Augmented Generation (RAG) frameworks address some of these gaps but often struggle with structured and semi-structured data.
This work proposes an advanced RAG framework that combines hybrid retrieval strategies using dense embeddings (all-mpnet-base-v2) and BM25, enhanced by metadata-aware filtering with SpaCy NER and cross-encoder reranking. The framework applies semantic chunking to maintain textual coherence and retains tabular data structures to preserve row-column integrity. Quantized indexing optimizes retrieval efficiency, while human-in-the-loop feedback and conversation memory improve adaptability.
Experiments on enterprise datasets show notable improvements: Precision@5 increased by 15 percent (90 versus 75), Recall@5 by 13 percent (87 versus 74), and Mean Reciprocal Rank by 16 percent (0.85 versus 0.69). Qualitative evaluations show higher scores in Faithfulness (4.6 versus 3.0), Completeness (4.2 versus 2.5), and Relevance (4.5 versus 3.2) on a 5-point Likert scale. These results demonstrate the framework's effectiveness in delivering accurate, comprehensive, and contextually relevant responses for enterprise tasks. Future work includes extending to multimodal data and integrating agent-based retrieval. The source code will be released at https://github.com/CheerlaChandana/Enterprise-Chatbot
中文: 该先进RAG框架通过混合检索与元数据过滤及语义分块相结合,显著提升了企业数据处理中的精确率、召回率和响应质量,在评估中表现优异。
English: This advanced RAG framework enhances enterprise data processing by combining hybrid retrieval with metadata filtering and semantic chunking, significantly improving precision, recall, and response quality in evaluations.
Authors:Andrea Perin, Giacomo Lagomarsini, Claudio Gallicchio, Giuseppe Nuti
Abstract:
We introduce a Mixture of Raytraced Experts, a stacked Mixture of Experts (MoE) architecture which can dynamically select sequences of experts, producing computational graphs of variable width and depth. Existing MoE architectures generally require a fixed amount of computation for a given sample. Our approach, in contrast, yields predictions with increasing accuracy as the computation cycles through the experts' sequence. We train our model by iteratively sampling from a set of candidate experts, unfolding the sequence akin to how Recurrent Neural Networks are trained. Our method does not require load-balancing mechanisms, and preliminary experiments show a reduction in training epochs of 10\% to 40\% with a comparable/higher accuracy. These results point to new research directions in the field of MoEs, allowing the design of potentially faster and more expressive models. The code is available at https://github.com/nutig/RayTracing
中文摘要:混合光线追踪专家是一种动态MoE架构,通过自适应选择专家序列实现计算量与精度同步提升,无需负载平衡即可加速训练并提高模型性能。
English Summary: The Mixture of Raytraced Experts is a dynamic MoE architecture that adaptively sequences experts to enhance accuracy with variable computation, achieving faster training and higher performance without load-balancing.
Authors:Jaehyun Kwak, Ramahdani Muhammad Izaaz Inhar, Se-Young Yun, Sung-Ju Lee
Abstract:
Composed Image Retrieval (CIR) retrieves relevant images based on a reference image and accompanying text describing desired modifications. However, existing CIR methods only focus on retrieving the target image and disregard the relevance of other images. This limitation arises because most methods employing contrastive learning-which treats the target image as positive and all other images in the batch as negatives-can inadvertently include false negatives. This may result in retrieving irrelevant images, reducing user satisfaction even when the target image is retrieved. To address this issue, we propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which optimizes a reward model objective to reduce false negatives. Additionally, we introduce a hard negative sampling strategy that selects images positioned between two steep drops in relevance scores following the target image, to effectively filter false negatives. In order to evaluate CIR models on their alignment with human satisfaction, we create Human-Preference FashionIQ (HP-FashionIQ), a new dataset that explicitly captures user preferences beyond target retrieval. Extensive experiments demonstrate that QuRe achieves state-of-the-art performance on FashionIQ and CIRR datasets while exhibiting the strongest alignment with human preferences on the HP-FashionIQ dataset. The source code is available at https://github.com/jackwaky/QuRe.
Chinese: 提出的通过硬负样本采样的查询相关检索方法QuRe,通过优化奖励模型和采用硬负样本采样策略,解决了组合图像检索中的假阴性问题,在新旧数据集上均实现了最优性能,并与人类偏好更佳对齐。
English: The proposed Query-Relevant Retrieval through Hard Negative Sampling (QuRe) method addresses the issue of false negatives in composed image retrieval by optimizing a reward model and employing a hard negative sampling strategy, achieving state-of-the-art performance and better alignment with human preferences on new and existing datasets.
Authors:Kaiwen Huang, Yi Zhou, Huazhu Fu, Yizhe Zhang, Chen Gong, Tao Zhou
Abstract:
Semi-supervised medical image segmentation is a crucial technique for alleviating the high cost of data annotation. When labeled data is limited, textual information can provide additional context to enhance visual semantic understanding. However, research exploring the use of textual data to enhance visual semantic embeddings in 3D medical imaging tasks remains scarce. In this paper, we propose a novel text-driven multiplanar visual interaction framework for semi-supervised medical image segmentation (termed Text-SemiSeg), which consists of three main modules: Text-enhanced Multiplanar Representation (TMR), Category-aware Semantic Alignment (CSA), and Dynamic Cognitive Augmentation (DCA). Specifically, TMR facilitates text-visual interaction through planar mapping, thereby enhancing the category awareness of visual features. CSA performs cross-modal semantic alignment between the text features with introduced learnable variables and the intermediate layer of visual features. DCA reduces the distribution discrepancy between labeled and unlabeled data through their interaction, thus improving the model's robustness. Finally, experiments on three public datasets demonstrate that our model effectively enhances visual features with textual information and outperforms other methods. Our code is available at https://github.com/taozh2017/Text-SemiSeg.
中文摘要:本文提出Text-SemiSeg框架,通过文本驱动的多平面视觉交互和语义对齐技术提升半监督医学图像分割效果,在三个公开数据集上验证了其优越性。
English Summary: This paper introduces Text-SemiSeg, a novel text-driven framework that enhances 3D medical image segmentation through multiplanar visual-text interaction and semantic alignment, demonstrating superior performance over existing methods.
Authors:Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang, Justine Gehring, Antonio Orvieto, Muawiz Chaudhary, Eilif B. Muller, Irina Rish, Samira Ebrahimi Kahou, Massimo Caccia
Abstract:
The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.
Chinese: GitChameleon 2.0数据集通过提供可执行的单元测试来评估AI系统在特定版本代码生成方面的能力,揭示了当前模型在此任务上面临显著挑战。
English: The GitChameleon 2.0 dataset addresses the challenge of version-specific code generation by providing executable unit tests to evaluate AI systems, revealing that current models struggle significantly with this task.
Authors:M. Anwar Ma'sum, Mahardhika Pratama, Savitha Ramasamy, Lin Liu, Habibullah Habibullah, Ryszard Kowalczyk
Abstract:
The data privacy constraint in online continual learning (OCL), where the data can be seen only once, complicates the catastrophic forgetting problem in streaming data. A common approach applied by the current SOTAs in OCL is with the use of memory saving exemplars or features from previous classes to be replayed in the current task. On the other hand, the prompt-based approach performs excellently in continual learning but with the cost of a growing number of trainable parameters. The first approach may not be applicable in practice due to data openness policy, while the second approach has the issue of throughput associated with the streaming data. In this study, we propose a novel prompt-based method for online continual learning that includes 4 main components: (1) single light-weight prompt generator as a general knowledge, (2) trainable scaler-and-shifter as specific knowledge, (3) pre-trained model (PTM) generalization preserving, and (4) hard-soft updates mechanism. Our proposed method achieves significantly higher performance than the current SOTAs in CIFAR100, ImageNet-R, ImageNet-A, and CUB dataset. Our complexity analysis shows that our method requires a relatively smaller number of parameters and achieves moderate training time, inference time, and throughput. For further study, the source code of our method is available at https://github.com/anwarmaxsum/PROL.
Chinese: 本研究提出了一种新颖的在线持续学习提示方法,结合轻量级提示生成器、可训练的缩放移位器、预训练模型保持和硬软更新机制,以较少参数和适中计算成本实现了卓越性能。
English: This study introduces a novel prompt-based method for online continual learning that integrates a lightweight prompt generator, trainable scaler-shifter, pre-trained model preservation, and a hard-soft update mechanism, achieving superior performance with fewer parameters and moderate computational demands.
Authors:Feng Xiao, Jicong Fan
Abstract:
Text anomaly detection is a critical task in natural language processing (NLP), with applications spanning fraud detection, misinformation identification, spam detection and content moderation, etc. Despite significant advances in large language models (LLMs) and anomaly detection algorithms, the absence of standardized and comprehensive benchmarks for evaluating the existing anomaly detection methods on text data limits rigorous comparison and development of innovative approaches. This work performs a comprehensive empirical study and introduces a benchmark for text anomaly detection, leveraging embeddings from diverse pre-trained language models across a wide array of text datasets. Our work systematically evaluates the effectiveness of embedding-based text anomaly detection by incorporating (1) early language models (GloVe, BERT); (2) multiple LLMs (LLaMa-2, LLama-3, Mistral, OpenAI (small, ada, large)); (3) multi-domain text datasets (news, social media, scientific publications); (4) comprehensive evaluation metrics (AUROC, AUPRC). Our experiments reveal a critical empirical insight: embedding quality significantly governs anomaly detection efficacy, and deep learning-based approaches demonstrate no performance advantage over conventional shallow algorithms (e.g., KNN, Isolation Forest) when leveraging LLM-derived embeddings.In addition, we observe strongly low-rank characteristics in cross-model performance matrices, which enables an efficient strategy for rapid model evaluation (or embedding evaluation) and selection in practical applications. Furthermore, by open-sourcing our benchmark toolkit that includes all embeddings from different models and code at https://github.com/jicongfan/Text-Anomaly-Detection-Benchmark, this work provides a foundation for future research in robust and scalable text anomaly detection systems.
中文: 本研究构建了文本异常检测的综合基准,发现嵌入质量对性能至关重要且使用大语言模型嵌入时深度学习方法相比传统算法并无优势,同时提供了开源工具包以支持未来研究。
English: This study establishes a comprehensive benchmark for text anomaly detection, revealing that embedding quality is crucial for performance and deep learning models offer no advantage over traditional methods when using LLM embeddings, while also providing an open-source toolkit for future research.
Authors:Johann Frei, Nils Feldhus, Lisa Raithel, Roland Roller, Alexander Meyer, Frank Kramer
Abstract:
For clinical data integration and healthcare services, the HL7 FHIR standard has established itself as a desirable format for interoperability between complex health data. Previous attempts at automating the translation from free-form clinical notes into structured FHIR resources rely on modular, rule-based systems or LLMs with instruction tuning and constrained decoding. Since they frequently suffer from limited generalizability and structural inconformity, we propose an end-to-end framework powered by LLM agents, code execution, and healthcare terminology database tools to address these issues. Our solution, called Infherno, is designed to adhere to the FHIR document schema and competes well with a human baseline in predicting FHIR resources from unstructured text. The implementation features a front end for custom and synthetic data and both local and proprietary models, supporting clinical data integration processes and interoperability across institutions.
中文:Infherno框架通过结合LLM智能体、代码执行和医学术语库工具,能够将非结构化临床笔记准确转换为结构化FHIR资源,在保持标准兼容性的同时达到了接近人工基准的转换效果。
English: The proposed Infherno framework utilizes LLM agents, code execution, and healthcare terminology tools to accurately translate unstructured clinical notes into structured FHIR resources, outperforming previous methods and approaching human-level performance in ensuring interoperability.
Authors:Shilin Zhou, Zhenghua Li
Abstract:
While end-to-end Automatic Speech Recognition (ASR) models have shown impressive performance in transcribing general speech, they often struggle to accurately recognize contextually relevant keywords, such as proper nouns or user-specific entities.
Previous approaches have explored leveraging keyword dictionaries in the textual modality to improve keyword recognition, either through token-level fusion that guides token-by-token generation or phrase-level fusion that enables direct copying of keyword phrases.
However, these methods operate at different granularities and have their own limitations.
In this paper, we propose a novel multi-grained fusion approach that jointly leverages the strengths of both token-level and phrase-level fusion with Large Language Models (LLMs).
Our approach incorporates a late-fusion strategy that elegantly combines ASR's acoustic information with LLM's rich contextual knowledge, balancing fine-grained token precision with holistic phrase-level understanding.
Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics while preserving high accuracy on non-keyword text.
Ablation studies further confirm that the token-level and phrase-level components both contribute significantly to the performance gains, complementing each other in our joint multi-grained framework.
The code and models will be publicly available at https://github.com/.
中文: 本文提出了一种新颖的多粒度融合方法,结合了词级和短语级策略与大型语言模型,显著提升了自动语音识别中的关键词识别性能,在中英文数据集上均取得了最优结果,同时保持了非关键词文本的高准确率。
English: This paper introduces a novel multi-grained fusion approach that combines token-level and phrase-level strategies with Large Language Models to enhance keyword recognition in Automatic Speech Recognition, achieving state-of-the-art performance on both Chinese and English datasets while maintaining general transcription accuracy.
Authors:Felix Nützel, Mischa Dombrowski, Bernhard Kainz
Abstract:
Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to identify regions of high certainty. BBM refines cross-attention maps, achieving even greater localization accuracy. Our results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain, paving the way for more robust and interpretable applications in clinical practice. The source code and model weights are available at https://github.com/Felix-012/generate_to_ground.
中文: 基于生成式文本到图像扩散模型,结合跨注意力机制和领域特定语言模型微调,在医学影像短语定位任务中实现了卓越的零样本性能,其mIoU分数较现有判别方法提升一倍,并通过新型双模态偏置融合技术进一步提升了定位精度。
English: Generative text-to-image diffusion models, enhanced with cross-attention maps and fine-tuned using domain-specific language models, achieve superior zero-shot phrase grounding performance in medical imaging, doubling mIoU scores over current discriminative methods and introducing Bimodal Bias Merging for further accuracy improvements.
Authors:Azhar Ikhtiarudin, Aditi Das, Param Thakkar, Akash Kundu
Abstract:
We introduce BenchRL-QAS, a unified benchmarking framework for systematically evaluating reinforcement learning (RL) algorithms in quantum architecture search (QAS) across diverse variational quantum algorithm tasks and system sizes ranging from 2- to 8-qubit. Our study benchmarks nine RL agents including both value-based and policy-gradient methods on representative quantum problems such as variational quantum eigensolver, variational quantum state diagonalization, quantum classification, and state preparation, spanning both noiseless and realistic noisy regimes. We propose a weighted ranking metric that balances accuracy, circuit depth, gate count, and computational efficiency, enabling fair and comprehensive comparison. Our results first reveal that RL-based quantum classifier outperforms baseline variational classifiers. Then we conclude that no single RL algorithm is universally optimal when considering a set of QAS tasks; algorithmic performance is highly context-dependent, varying with task structure, qubit count, and noise. This empirical finding provides strong evidence for the "no free lunch" principle in RL-based quantum circuit design and highlights the necessity of tailored algorithm selection and systematic benchmarking for advancing quantum circuit synthesis. This work represents the most comprehensive RL-QAS benchmarking effort to date, and BenchRL-QAS along with all experimental data are made publicly available to support reproducibility and future research https://github.com/azhar-ikhtiarudin/bench-rlqas.
中文摘要:BenchRL-QAS是一个统一的量子架构搜索强化学习基准框架,通过系统评估九种不同智能体在多种量子任务中的表现,证明不存在通用最优方法,且性能受任务类型和噪声条件影响。
English Summary: BenchRL-QAS is a comprehensive benchmarking framework that evaluates nine RL agents across various quantum tasks, revealing no universally superior method and demonstrating task-dependent performance under different conditions.
Authors:Azhar Ikhtiarudin, Aditi Das, Param Thakkar, Akash Kundu
Abstract:
We present BenchRL-QAS, a unified benchmarking framework for reinforcement learning (RL) in quantum architecture search (QAS) across a spectrum of variational quantum algorithm tasks on 2- to 8-qubit systems. Our study systematically evaluates 9 different RL agents, including both value-based and policy-gradient methods, on quantum problems such as variational eigensolver, quantum state diagonalization, variational quantum classification (VQC), and state preparation, under both noiseless and noisy execution settings. To ensure fair comparison, we propose a weighted ranking metric that integrates accuracy, circuit depth, gate count, and training time. Results demonstrate that no single RL method dominates universally, the performance dependents on task type, qubit count, and noise conditions providing strong evidence of no free lunch principle in RL-QAS. As a byproduct we observe that a carefully chosen RL algorithm in RL-based VQC outperforms baseline VQCs. BenchRL-QAS establishes the most extensive benchmark for RL-based QAS to date, codes and experimental made publicly available for reproducibility and future advances.
中文摘要:BenchRL-QAS是一个统一的量子架构搜索强化学习基准框架,通过系统评估九种不同智能体在多种量子任务中的表现,证明不存在通用最优方法,且性能受任务类型和噪声条件影响。
English Summary: BenchRL-QAS is a comprehensive benchmarking framework that evaluates nine RL agents across various quantum tasks, revealing no universally superior method and demonstrating task-dependent performance under different conditions.
Authors:Shuangli Du, Siming Yan, Zhenghao Shi, Zhenzhen You, Lu Sun
Abstract:
Low-light images suffer from complex degradation, and existing enhancement methods often encode all degradation factors within a single latent space. This leads to highly entangled features and strong black-box characteristics, making the model prone to shortcut learning. To mitigate the above issues, this paper proposes a wavelet-based low-light stereo image enhancement method with feature space decoupling. Our insight comes from the following findings: (1) Wavelet transform enables the independent processing of low-frequency and high-frequency information. (2) Illumination adjustment can be achieved by adjusting the low-frequency component of a low-light image, extracted through multi-level wavelet decomposition. Thus, by using wavelet transform the feature space is decomposed into a low-frequency branch for illumination adjustment and multiple high-frequency branches for texture enhancement. Additionally, stereo low-light image enhancement can extract useful cues from another view to improve enhancement. To this end, we propose a novel high-frequency guided cross-view interaction module (HF-CIM) that operates within high-frequency branches rather than across the entire feature space, effectively extracting valuable image details from the other view. Furthermore, to enhance the high-frequency information, a detail and texture enhancement module (DTEM) is proposed based on cross-attention mechanism. The model is trained on a dataset consisting of images with uniform illumination and images with non-uniform illumination. Experimental results on both real and synthetic images indicate that our algorithm offers significant advantages in light adjustment while effectively recovering high-frequency information. The code and dataset are publicly available at: https://github.com/Cherisherr/WDCI-Net.git.
中文摘要:本文提出了一种基于小波变换的低光立体图像增强方法,通过将特征空间解耦为低频和高频分支,并利用跨视角交互与纹理增强模块,有效改善了光照调整与细节恢复效果。
English Summary: This paper introduces a wavelet-based method that decouples feature space into low-frequency and high-frequency branches for stereo low-light image enhancement, utilizing cross-view interaction and texture enhancement modules to improve illumination adjustment and detail recovery.
Authors:Xiucheng Wang, Qiming Zhang, Nan Cheng, Junting Chen, Zezhong Zhang, Zan Li, Shuguang Cui, Xuemin Shen
Abstract:
Radio maps (RMs) serve as a critical foundation for enabling environment-aware wireless communication, as they provide the spatial distribution of wireless channel characteristics. Despite recent progress in RM construction using data-driven approaches, most existing methods focus solely on pathloss prediction in a fixed 2D plane, neglecting key parameters such as direction of arrival (DoA), time of arrival (ToA), and vertical spatial variations. Such a limitation is primarily due to the reliance on static learning paradigms, which hinder generalization beyond the training data distribution. To address these challenges, we propose UrbanRadio3D, a large-scale, high-resolution 3D RM dataset constructed via ray tracing in realistic urban environments. UrbanRadio3D is over 37$\times$3 larger than previous datasets across a 3D space with 3 metrics as pathloss, DoA, and ToA, forming a novel 3D$\times$33D dataset with 7$\times$3 more height layers than prior state-of-the-art (SOTA) dataset. To benchmark 3D RM construction, a UNet with 3D convolutional operators is proposed. Moreover, we further introduce RadioDiff-3D, a diffusion-model-based generative framework utilizing the 3D convolutional architecture. RadioDiff-3D supports both radiation-aware scenarios with known transmitter locations and radiation-unaware settings based on sparse spatial observations. Extensive evaluations on UrbanRadio3D validate that RadioDiff-3D achieves superior performance in constructing rich, high-dimensional radio maps under diverse environmental dynamics. This work provides a foundational dataset and benchmark for future research in 3D environment-aware communication. The dataset is available at https://github.com/UNIC-Lab/UrbanRadio3D.
中文: 本文提出UrbanRadio3D这一大规模高分辨率三维无线电地图数据集,并开发了基于扩散模型的RadioDiff-3D框架,在复杂环境动态下实现了高维无线电地图构建的卓越性能,为环境感知通信研究提供了基础数据集和基准。
English: This paper introduces UrbanRadio3D, a large-scale 3D radio map dataset with enhanced resolution and metrics, and proposes RadioDiff-3D, a diffusion-based model that achieves superior performance in constructing high-dimensional radio maps for environment-aware communication.
Authors:Sergey Linok, Gleb Naumov
Abstract:
We propose OVIGo-3DHSG method - Open-Vocabulary Indoor Grounding of objects using 3D Hierarchical Scene Graph. OVIGo-3DHSG represents an extensive indoor environment over a Hierarchical Scene Graph derived from sequences of RGB-D frames utilizing a set of open-vocabulary foundation models and sensor data processing. The hierarchical representation explicitly models spatial relations across floors, rooms, locations, and objects. To effectively address complex queries involving spatial reference to other objects, we integrate the hierarchical scene graph with a Large Language Model for multistep reasoning. This integration leverages inter-layer (e.g., room-to-object) and intra-layer (e.g., object-to-object) connections, enhancing spatial contextual understanding. We investigate the semantic and geometry accuracy of hierarchical representation on Habitat Matterport 3D Semantic multi-floor scenes. Our approach demonstrates efficient scene comprehension and robust object grounding compared to existing methods. Overall OVIGo-3DHSG demonstrates strong potential for applications requiring spatial reasoning and understanding of indoor environments. Related materials can be found at https://github.com/linukc/OVIGo-3DHSG.
中文: OVIGo-3DHSG方法通过三维分层场景图与大语言模型相结合,实现了开放词汇的室内物体定位与空间推理,在场景理解能力上显著优于现有方法。
English: The OVIGo-3DHSG method utilizes a 3D hierarchical scene graph and large language models to enable open-vocabulary object grounding with enhanced spatial reasoning in indoor environments, demonstrating superior scene comprehension compared to existing approaches.
Authors:Ye Han, Lijun Zhang, Dejian Meng, Zhuang Zhang
Abstract:
The exploration-exploitation trade-off constitutes one of the fundamental challenges in reinforcement learning (RL), which is exacerbated in multi-agent reinforcement learning (MARL) due to the exponential growth of joint state-action spaces. This paper proposes a topology-enhanced MARL (TPE-MARL) method for optimizing cooperative decision-making of connected and autonomous vehicles (CAVs) in mixed traffic. This work presents two primary contributions: First, we construct a game topology tensor for dynamic traffic flow, effectively compressing high-dimensional traffic state information and decrease the search space for MARL algorithms. Second, building upon the designed game topology tensor and using QMIX as the backbone RL algorithm, we establish a topology-enhanced MARL framework incorporating visit counts and agent mutual information. Extensive simulations across varying traffic densities and CAV penetration rates demonstrate the effectiveness of TPE-MARL. Evaluations encompassing training dynamics, exploration patterns, macroscopic traffic performance metrics, and microscopic vehicle behaviors reveal that TPE-MARL successfully balances exploration and exploitation. Consequently, it exhibits superior performance in terms of traffic efficiency, safety, decision smoothness, and task completion. Furthermore, the algorithm demonstrates decision-making rationality comparable to or exceeding that of human drivers in both mixed-autonomy and fully autonomous traffic scenarios. Code of our work is available at \href{https://github.com/leoPub/tpemarl}{https://github.com/leoPub/tpemarl}.
中文: 本文提出了一种拓扑增强多智能体强化学习方法,通过构建博弈拓扑张量和结合访问计数与智能体互信息,有效解决了混合交通中协同决策的探索-利用权衡问题,在多种交通场景下展现出卓越的交通效率与安全性。
English: This paper introduces a topology-enhanced multi-agent reinforcement learning (TPE-MARL) method that effectively balances exploration and exploitation for cooperative decision-making in connected autonomous vehicles, demonstrating superior performance in traffic efficiency and safety across various scenarios.
Authors:Yiquan Gao, Duohui Xu
Abstract:
Biomedical segmentation networks easily suffer from the unexpected misclassification between foreground and background objects when learning on limited and imperfect medical datasets. Inspired by the strong power of Out-of-Distribution (OoD) data on other visual tasks, we propose a data-centric framework, Med-OoD to address this issue by introducing OoD data supervision into fully-supervised biomedical segmentation with none of the following needs: (i) external data sources, (ii) feature regularization objectives, (iii) additional annotations. Our method can be seamlessly integrated into segmentation networks without any modification on the architectures. Extensive experiments show that Med-OoD largely prevents various segmentation networks from the pixel misclassification on medical images and achieves considerable performance improvements on Lizard dataset. We also present an emerging learning paradigm of training a medical segmentation network completely using OoD data devoid of foreground class labels, surprisingly turning out 76.1% mIoU as test result. We hope this learning paradigm will attract people to rethink the roles of OoD data. Code is made available at https://github.com/StudioYG/Med-OoD.
中文:Med-OoD框架利用分布外数据提升生物医学分割的准确性,无需外部数据或网络结构调整即可显著减少误分类,并在医学数据集上实现性能提升。
English: The Med-OoD framework leverages Out-of-Distribution data to enhance biomedical segmentation accuracy without requiring external data or network modifications, significantly reducing misclassification and improving performance on medical datasets.
Authors:Nataliia Molchanova, Alessandro Cagol, Mario Ocampo-Pineda, Po-Jui Lu, Matthias Weigel, Xinjie Chen, Erin Beck, Charidimos Tsagkas, Daniel Reich, Colin Vanden Bulcke, Anna Stolting, Serena Borrelli, Pietro Maggi, Adrien Depeursinge, Cristina Granziera, Henning Mueller, Pedro M. Gordaliza, Meritxell Bach Cuadra
Abstract:
Cortical lesions (CLs) have emerged as valuable biomarkers in multiple sclerosis (MS), offering high diagnostic specificity and prognostic relevance. However, their routine clinical integration remains limited due to subtle magnetic resonance imaging (MRI) appearance, challenges in expert annotation, and a lack of standardized automated methods. We propose a comprehensive multi-centric benchmark of CL detection and segmentation in MRI. A total of 656 MRI scans, including clinical trial and research data from four institutions, were acquired at 3T and 7T using MP2RAGE and MPRAGE sequences with expert-consensus annotations. We rely on the self-configuring nnU-Net framework, designed for medical imaging segmentation, and propose adaptations tailored to the improved CL detection. We evaluated model generalization through out-of-distribution testing, demonstrating strong lesion detection capabilities with an F1-score of 0.64 and 0.5 in and out of the domain, respectively. We also analyze internal model features and model errors for a better understanding of AI decision-making. Our study examines how data variability, lesion ambiguity, and protocol differences impact model performance, offering future recommendations to address these barriers to clinical adoption. To reinforce the reproducibility, the implementation and models will be publicly accessible and ready to use at https://github.com/Medical-Image-Analysis-Laboratory/ and https://doi.org/10.5281/zenodo.15911797.
中文: 本研究提出一个基于MRI数据的皮质病变检测与分割基准,采用改进的nnU-Net模型展现出稳定性能,并公开工具以促进临床应用。
English: This study introduces a benchmark for detecting and segmenting cortical lesions in multiple sclerosis using MRI data, employing an adapted nnU-Net model that demonstrates robust performance and provides public access to tools to enhance clinical adoption.
Authors:Xiang Yu, Xinyao Liu, Guang Liang
Abstract:
Tracking small, agile multi-objects (SMOT), such as birds, from an Unmanned Aerial Vehicle (UAV) perspective is a highly challenging computer vision task. The difficulty stems from three main sources: the extreme scarcity of target appearance features, the complex motion entanglement caused by the combined dynamics of the camera and the targets themselves, and the frequent occlusions and identity ambiguity arising from dense flocking behavior. This paper details our championship-winning solution in the MVA 2025 "Finding Birds" Small Multi-Object Tracking Challenge (SMOT4SB), which adopts the tracking-by-detection paradigm with targeted innovations at both the detection and association levels. On the detection side, we propose a systematic training enhancement framework named \textbf{SliceTrain}. This framework, through the synergy of 'deterministic full-coverage slicing' and 'slice-level stochastic augmentation, effectively addresses the problem of insufficient learning for small objects in high-resolution image training. On the tracking side, we designed a robust tracker that is completely independent of appearance information. By integrating a \textbf{motion direction maintenance (EMA)} mechanism and an \textbf{adaptive similarity metric} combining \textbf{bounding box expansion and distance penalty} into the OC-SORT framework, our tracker can stably handle irregular motion and maintain target identities. Our method achieves state-of-the-art performance on the SMOT4SB public test set, reaching an SO-HOTA score of \textbf{55.205}, which fully validates the effectiveness and advancement of our framework in solving complex real-world SMOT problems. The source code will be made available at https://github.com/Salvatore-Love/YOLOv8-SMOT.
中文摘要:本文提出了一种针对无人机视角下小型敏捷物体跟踪的冠军解决方案,通过系统性训练增强框架提升检测能力,并设计了不依赖外观特征的鲁棒跟踪器来应对复杂运动模式。
English Summary: This paper presents a championship-winning solution for tracking small, agile objects from UAVs by enhancing detection through a systematic training framework and developing a robust tracker that handles motion complexities without relying on appearance features.
Authors:Giuliano Martinelli, Tommaso Bonomo, Pere-LluÃs Huguet Cabot, Roberto Navigli
Abstract:
Coreference Resolution systems are typically evaluated on benchmarks containing small- to medium-scale documents. When it comes to evaluating long texts, however, existing benchmarks, such as LitBank, remain limited in length and do not adequately assess system capabilities at the book scale, i.e., when co-referring mentions span hundreds of thousands of tokens. To fill this gap, we first put forward a novel automatic pipeline that produces high-quality Coreference Resolution annotations on full narrative texts. Then, we adopt this pipeline to create the first book-scale coreference benchmark, BOOKCOREF, with an average document length of more than 200,000 tokens. We carry out a series of experiments showing the robustness of our automatic procedure and demonstrating the value of our resource, which enables current long-document coreference systems to gain up to +20 CoNLL-F1 points when evaluated on full books. Moreover, we report on the new challenges introduced by this unprecedented book-scale setting, highlighting that current models fail to deliver the same performance they achieve on smaller documents. We release our data and code to encourage research and development of new book-scale Coreference Resolution systems at https://github.com/sapienzanlp/bookcoref.
中文摘要:本文提出了首个书籍规模的指代消解基准BOOKCOREF,通过创新的自动标注流程构建,揭示了现有系统在处理长文本时的性能局限,并为提升指代消解模型在全书尺度上的表现提供了重要资源。
English Summary: This paper introduces BOOKCOREF, the first book-scale coreference resolution benchmark created through a novel automatic annotation pipeline, which reveals significant performance gaps in current systems when handling long texts and enables substantial improvements in evaluation metrics.
Authors:Hongxu Ma, Guanshuo Wang, Fufu Yu, Qiong Jia, Shouhong Ding
Abstract:
Video Moment Retrieval (MR) and Highlight Detection (HD) aim to pinpoint specific moments and assess clip-wise relevance based on the text query. While DETR-based joint frameworks have made significant strides, there remains untapped potential in harnessing the intricate relationships between temporal motion and spatial semantics within video content. In this paper, we propose the Motion-Semantics DETR (MS-DETR), a framework that captures rich motion-semantics features through unified learning for MR/HD tasks. The encoder first explicitly models disentangled intra-modal correlations within motion and semantics dimensions, guided by the given text queries. Subsequently, the decoder utilizes the task-wise correlation across temporal motion and spatial semantics dimensions to enable precise query-guided localization for MR and refined highlight boundary delineation for HD. Furthermore, we observe the inherent sparsity dilemma within the motion and semantics dimensions of MR/HD datasets. To address this issue, we enrich the corpus from both dimensions by generation strategies and propose contrastive denoising learning to ensure the above components learn robustly and effectively. Extensive experiments on four MR/HD benchmarks demonstrate that our method outperforms existing state-of-the-art models by a margin. Our code is available at https://github.com/snailma0229/MS-DETR.git.
中文: 本文提出的MS-DETR框架通过统一学习视频中的运动与语义特征,结合对比去噪和数据增强策略,在时刻检索与高光检测任务上实现了超越现有最优模型的性能表现。
English: The paper introduces MS-DETR, a framework that leverages unified learning of motion and semantics features to enhance Video Moment Retrieval and Highlight Detection, achieving superior performance on benchmarks through contrastive denoising and data enrichment strategies.
Authors:Beining Xu, Siting Zhu, Hesheng Wang
Abstract:
We propose SGLoc, a novel localization system that directly regresses camera poses from 3D Gaussian Splatting (3DGS) representation by leveraging semantic information. Our method utilizes the semantic relationship between 2D image and 3D scene representation to estimate the 6DoF pose without prior pose information. In this system, we introduce a multi-level pose regression strategy that progressively estimates and refines the pose of query image from the global 3DGS map, without requiring initial pose priors. Moreover, we introduce a semantic-based global retrieval algorithm that establishes correspondences between 2D (image) and 3D (3DGS map). By matching the extracted scene semantic descriptors of 2D query image and 3DGS semantic representation, we align the image with the local region of the global 3DGS map, thereby obtaining a coarse pose estimation. Subsequently, we refine the coarse pose by iteratively optimizing the difference between the query image and the rendered image from 3DGS. Our SGLoc demonstrates superior performance over baselines on 12scenes and 7scenes datasets, showing excellent capabilities in global localization without initial pose prior. Code will be available at https://github.com/IRMVLab/SGLoc.
中文: SGLoc是一种新颖的定位系统,通过利用语义信息直接从3D高斯泼溅表示中回归相机位姿,在无需初始位姿先验的情况下,在基准数据集上展现出卓越性能。
English: SGLoc is a novel localization system that directly regresses camera poses from 3D Gaussian Splatting representation using semantic information, achieving superior performance on benchmark datasets without requiring initial pose priors.
Authors:Yuechen Xie, Jie Song, Yicheng Shan, Xiaoyan Zhang, Yuanyu Wan, Shengxuming Zhang, Jiarui Duan, Mingli Song
Abstract:
High-quality open-source datasets have emerged as a pivotal catalyst driving the swift advancement of deep learning, while facing the looming threat of potential exploitation. Protecting these datasets is of paramount importance for the interests of their owners. The verification of dataset ownership has evolved into a crucial approach in this domain; however, existing verification techniques are predominantly tailored to supervised models and contrastive pre-trained models, rendering them ill-suited for direct application to the increasingly prevalent masked models. In this work, we introduce the inaugural methodology addressing this critical, yet unresolved challenge, termed Dataset Ownership Verification for Masked Modeling (DOV4MM). The central objective is to ascertain whether a suspicious black-box model has been pre-trained on a particular unlabeled dataset, thereby assisting dataset owners in safeguarding their rights. DOV4MM is grounded in our empirical observation that when a model is pre-trained on the target dataset, the difficulty of reconstructing masked information within the embedding space exhibits a marked contrast to models not pre-trained on that dataset. We validated the efficacy of DOV4MM through ten masked image models on ImageNet-1K and four masked language models on WikiText-103. The results demonstrate that DOV4MM rejects the null hypothesis, with a $p$-value considerably below 0.05, surpassing all prior approaches. Code is available at https://github.com/xieyc99/DOV4MM.
Chinese: 本研究提出的DOV4MM方法通过利用在目标数据集上预训练的模型中对掩码信息重建难度的显著差异,解决了掩码模型数据集所有权验证这一关键挑战,其性能显著优于现有方法。
English: The proposed DOV4MM method addresses the critical challenge of verifying dataset ownership for masked models by exploiting the distinct reconstruction difficulty of masked information in models pre-trained on the target dataset, demonstrating superior performance over existing techniques.
Authors:Linwei Chen, Lin Gu, Ying Fu
Abstract:
Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures. We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function. Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings. The code is available at https://github.com/Linwei-Chen/FDAM.
中文: 提出的频率动态注意力调制(FDAM)通过互补高通滤波和动态频率缩放解决视觉Transformer中的频率消失问题,在多种视觉任务中实现性能提升且避免表示崩溃。
English: The proposed Frequency-Dynamic Attention Modulation (FDAM) enhances Vision Transformers by addressing frequency vanishing through complementary high-pass filtering and dynamic frequency scaling, resulting in improved performance across multiple vision tasks without representation collapse.
Authors:Hao Li, Ju Dai, Feng Zhou, Kaida Ning, Lei Li, Junjun Pan
Abstract:
While 3D facial animation has made impressive progress, challenges still exist in realizing fine-grained stylized 3D facial expression manipulation due to the lack of appropriate datasets. In this paper, we introduce the AUBlendSet, a 3D facial dataset based on AU-Blendshape representation for fine-grained facial expression manipulation across identities. AUBlendSet is a blendshape data collection based on 32 standard facial action units (AUs) across 500 identities, along with an additional set of facial postures annotated with detailed AUs. Based on AUBlendSet, we propose AUBlendNet to learn AU-Blendshape basis vectors for different character styles. AUBlendNet predicts, in parallel, the AU-Blendshape basis vectors of the corresponding style for a given identity mesh, thereby achieving stylized 3D emotional facial manipulation. We comprehensively validate the effectiveness of AUBlendSet and AUBlendNet through tasks such as stylized facial expression manipulation, speech-driven emotional facial animation, and emotion recognition data augmentation. Through a series of qualitative and quantitative experiments, we demonstrate the potential and importance of AUBlendSet and AUBlendNet in 3D facial animation tasks. To the best of our knowledge, AUBlendSet is the first dataset, and AUBlendNet is the first network for continuous 3D facial expression manipulation for any identity through facial AUs. Our source code is available at https://github.com/wslh852/AUBlendNet.git.
中文: 本文提出了首个基于动作单元混合形状的3D人脸数据集AUBlendSet,并开发了AUBlendNet网络,通过学习不同风格的动作单元基向量,实现了跨身份的风格化3D面部表情操控。
English: This paper introduces AUBlendSet, the first 3D facial dataset using AU-Blendshape representation, and AUBlendNet, a novel network that enables stylized 3D facial expression manipulation across identities by learning action unit basis vectors.
Authors:Jiahao Xia, Yike Wu, Wenjian Huang, Jianguo Zhang, Jian Zhang
Abstract:
Part-level features are crucial for image understanding, but few studies focus on them because of the lack of fine-grained labels. Although unsupervised part discovery can eliminate the reliance on labels, most of them cannot maintain robustness across various categories and scenarios, which restricts their application range. To overcome this limitation, we present a more effective paradigm for unsupervised part discovery, named Masked Part Autoencoder (MPAE). It first learns part descriptors as well as a feature map from the inputs and produces patch features from a masked version of the original images. Then, the masked regions are filled with the learned part descriptors based on the similarity between the local features and descriptors. By restoring these masked patches using the part descriptors, they become better aligned with their part shapes, guided by appearance features from unmasked patches. Finally, MPAE robustly discovers meaningful parts that closely match the actual object shapes, even in complex scenarios. Moreover, several looser yet more effective constraints are proposed to enable MPAE to identify the presence of parts across various scenarios and categories in an unsupervised manner. This provides the foundation for addressing challenges posed by occlusion and for exploring part similarity across multiple categories. Extensive experiments demonstrate that our method robustly discovers meaningful parts across various categories and scenarios. The code is available at the project https://github.com/Jiahao-UTS/MPAE.
Chinese: 掩码部件自编码器(MPAE)提出了一种无监督方法,通过学习部件描述符并重构掩码图像块,能在各种类别和场景中稳健地发现有意义部件,克服了以往方法的局限性。
English: The Masked Part Autoencoder (MPAE) introduces an unsupervised method that robustly discovers meaningful parts across diverse categories and scenarios by learning part descriptors and reconstructing masked image patches, overcoming limitations of previous approaches.
Authors:Artem Alekseev, Mikhail Chaichuk, Miron Butko, Alexander Panchenko, Elena Tutubalina, Oleg Somov
Abstract:
Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions. Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers. We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks. Through generalization and rejection studies, we evaluate robustness across multi-hop and temporal QA datasets. Additionally, we introduce a novel entity linking and predicate matching method using CoT reasoning. Our results demonstrate the potential of query-based multi-stage KGQA framework for improving multi-hop and temporal QA with small language models. Code and data: https://github.com/ar2max/NLDB-KGQA-System
中文摘要:该研究提出了一种基于查询的多阶段知识图谱问答框架,通过结合新颖的实体链接和谓词匹配方法,利用小型语言模型有效提升了多跳推理和时间推理问题的处理能力。
English Summary: The study presents a multi-stage query-based knowledge graph QA framework that improves multi-hop and temporal reasoning using small language models, incorporating novel entity linking and predicate matching methods.
Authors:Jianzhe Ma, Wenxuan Wang, Qin Jin
Abstract:
Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.
中文: 本文综述了深度学习在几何解题中的应用,涵盖任务、方法、评估指标及未来挑战,旨在推动该领域发展。
English: This paper surveys deep learning applications in geometry problem solving, covering tasks, methods, evaluation metrics, and future challenges to advance the field.
Authors:Shuichiro Nishigori, Koichi Saito, Naoki Murata, Masato Hirano, Shusuke Takahashi, Yuki Mitsufuji
Abstract:
Speech enhancement (SE) utilizing diffusion models is a promising technology that improves speech quality in noisy speech data. Furthermore, the Schrödinger bridge (SB) has recently been used in diffusion-based SE to improve speech quality by resolving a mismatch between the endpoint of the forward process and the starting point of the reverse process. However, the SB still exhibits slow inference owing to the necessity of a large number of function evaluations (NFE) for inference to obtain high-quality results. While Consistency Models (CMs) address this issue by employing consistency training that uses distillation from pretrained models in the field of image generation, it does not improve generation quality when the number of steps increases. As a solution to this problem, Consistency Trajectory Models (CTMs) not only accelerate inference speed but also maintain a favorable trade-off between quality and speed. Furthermore, SoundCTM demonstrates the applicability of CTM techniques to the field of sound generation. In this paper, we present Schrödinger bridge Consistency Trajectory Models (SBCTM) by applying the CTM's technique to the Schrödinger bridge for SE. Additionally, we introduce a novel auxiliary loss, including a perceptual loss, into the original CTM's training framework. As a result, SBCTM achieves an approximately 16x improvement in the real-time factor (RTF) compared to the conventional Schrödinger bridge for SE. Furthermore, the favorable trade-off between quality and speed in SBCTM allows for time-efficient inference by limiting multi-step refinement to cases where 1-step inference is insufficient. Our code, pretrained models, and audio samples are available at https://github.com/sony/sbctm/.
中文: 本文提出薛定谔桥一致性轨迹模型(SBCTM),通过将一致性轨迹技术与感知损失相结合,在提升语音质量的同时显著加速推理速度,相比传统方法实现了16倍的实时因子提升。
English: The paper introduces Schrödinger bridge Consistency Trajectory Models (SBCTM), which enhance speech quality and accelerate inference speed by integrating consistency trajectory techniques with perceptual loss, achieving a 16x improvement in real-time factor over traditional methods.
Authors:Juscimara G. Avelino, George D. C. Cavalcanti, Rafael M. O. Cruz
Abstract:
Imbalanced problems can arise in different real-world situations, and to address this, certain strategies in the form of resampling or balancing algorithms are proposed. This issue has largely been studied in the context of classification, and yet, the same problem features in regression tasks, where target values are continuous. This work presents an extensive experimental study comprising various balancing and predictive models, and wich uses metrics to capture important elements for the user and to evaluate the predictive model in an imbalanced regression data context. It also proposes a taxonomy for imbalanced regression approaches based on three crucial criteria: regression model, learning process, and evaluation metrics. The study offers new insights into the use of such strategies, highlighting the advantages they bring to each model's learning process, and indicating directions for further studies. The code, data and further information related to the experiments performed herein can be found on GitHub: https://github.com/JusciAvelino/imbalancedRegression.
中文摘要:本研究对不平衡回归问题中的平衡策略与预测模型进行了全面实验分析,提出了基于关键标准的分类体系,并为其应用及未来研究方向提供了新的见解。
English Summary: This study conducts a comprehensive experimental analysis of balancing strategies and predictive models for imbalanced regression tasks, proposing a taxonomy based on key criteria and providing insights into their application and future research directions.
Authors:Juscimara G. Avelino, George D. C. Cavalcanti, Rafael M. O. Cruz
Abstract:
Imbalanced problems are prevalent in various real-world scenarios and are extensively explored in classification tasks. However, they also present challenges for regression tasks due to the rarity of certain target values. A common alternative is to employ balancing algorithms in preprocessing to address dataset imbalance. However, due to the variety of resampling methods and learning models, determining the optimal solution requires testing many combinations. Furthermore, the learning model, dataset, and evaluation metric affect the best strategies. This work proposes the Meta-learning for Imbalanced Regression (Meta-IR) framework, which diverges from existing literature by training meta-classifiers to recommend the best pipeline composed of the resampling strategy and learning model per task in a zero-shot fashion. The meta-classifiers are trained using a set of meta-features to learn how to map the meta-features to the classes indicating the best pipeline. We propose two formulations: Independent and Chained. Independent trains the meta-classifiers to separately indicate the best learning algorithm and resampling strategy. Chained involves a sequential procedure where the output of one meta-classifier is used as input for another to model intrinsic relationship factors. The Chained scenario showed superior performance, suggesting a relationship between the learning algorithm and the resampling strategy per task. Compared with AutoML frameworks, Meta-IR obtained better results. Moreover, compared with baselines of six learning algorithms and six resampling algorithms plus no resampling, totaling 42 (6 X 7) configurations, Meta-IR outperformed all of them. The code, data, and further information of the experiments can be found on GitHub: https://github.com/JusciAvelino/Meta-IR.
Chinese: Meta-IR框架提出了一种新颖的元学习方法,通过训练元分类器以零样本方式为每个不平衡回归任务推荐最佳的重采样策略和学习模型组合,其性能优于传统AutoML框架和所有基线配置。
English: The Meta-IR framework introduces a novel meta-learning approach that uses meta-classifiers to recommend the optimal combination of resampling strategies and learning models for imbalanced regression tasks in a zero-shot manner, outperforming traditional AutoML frameworks and baseline configurations.
Authors:Wei Sun, Linhan Cao, Kang Fu, Dandan Zhu, Jun Jia, Menghan Hu, Xiongkuo Min, Guangtao Zhai
Abstract:
Video compression is a standard procedure applied to all videos to minimize storage and transmission demands while preserving visual quality as much as possible. Therefore, evaluating the visual quality of compressed videos is crucial for guiding the practical usage and further development of video compression algorithms. Although numerous compressed video quality assessment (VQA) methods have been proposed, they often lack the generalization capability needed to handle the increasing diversity of video types, particularly high dynamic range (HDR) content. In this paper, we introduce CompressedVQA-HDR, an effective VQA framework designed to address the challenges of HDR video quality assessment. Specifically, we adopt the Swin Transformer and SigLip 2 as the backbone networks for the proposed full-reference (FR) and no-reference (NR) VQA models, respectively. For the FR model, we compute deep structural and textural similarities between reference and distorted frames using intermediate-layer features extracted from the Swin Transformer as its quality-aware feature representation. For the NR model, we extract the global mean of the final-layer feature maps from SigLip 2 as its quality-aware representation. To mitigate the issue of limited HDR training data, we pre-train the FR model on a large-scale standard dynamic range (SDR) VQA dataset and fine-tune it on the HDRSDR-VQA dataset. For the NR model, we employ an iterative mixed-dataset training strategy across multiple compressed VQA datasets, followed by fine-tuning on the HDRSDR-VQA dataset. Experimental results show that our models achieve state-of-the-art performance compared to existing FR and NR VQA models. Moreover, CompressedVQA-HDR-FR won first place in the FR track of the Generalizable HDR & SDR Video Quality Measurement Grand Challenge at IEEE ICME 2025. The code is available at https://github.com/sunwei925/CompressedVQA-HDR.
中文: 本文提出CompressedVQA-HDR框架,分别采用Swin Transformer和SigLip 2架构构建全参考和无参考视频质量评估模型,通过创新的训练策略在压缩HDR视频质量评估中实现了最优性能。
English: This paper introduces CompressedVQA-HDR, a novel video quality assessment framework that employs Swin Transformer and SigLip 2 architectures for full-reference and no-reference models respectively, achieving state-of-the-art performance in evaluating compressed HDR videos through innovative training strategies.
Authors:Linwei Chen, Ying Fu, Lin Gu, Dezhi Zheng, Jifeng Dai
Abstract:
High spatial frequency information, including fine details like textures, significantly contributes to the accuracy of semantic segmentation. However, according to the Nyquist-Shannon Sampling Theorem, high-frequency components are vulnerable to aliasing or distortion when propagating through downsampling layers such as strided-convolution. Here, we propose a novel Spatial Frequency Modulation (SFM) that modulates high-frequency features to a lower frequency before downsampling and then demodulates them back during upsampling. Specifically, we implement modulation through adaptive resampling (ARS) and design a lightweight add-on that can densely sample the high-frequency areas to scale up the signal, thereby lowering its frequency in accordance with the Frequency Scaling Property. We also propose Multi-Scale Adaptive Upsampling (MSAU) to demodulate the modulated feature and recover high-frequency information through non-uniform upsampling This module further improves segmentation by explicitly exploiting information interaction between densely and sparsely resampled areas at multiple scales. Both modules can seamlessly integrate with various architectures, extending from convolutional neural networks to transformers. Feature visualization and analysis confirm that our method effectively alleviates aliasing while successfully retaining details after demodulation. Finally, we validate the broad applicability and effectiveness of SFM by extending it to image classification, adversarial robustness, instance segmentation, and panoptic segmentation tasks. The code is available at https://github.com/Linwei-Chen/SFM.
Chinese Summary: 提出的空间频率调制(SFM)方法通过在下采样前调制高频特征并在上采样时解调,有效保留语义分割中的细节信息,并在多种计算机视觉任务中验证了其广泛适用性。
English Summary: The proposed Spatial Frequency Modulation (SFM) method effectively preserves high-frequency details in semantic segmentation by modulating features before downsampling and demodulating them during upsampling, with demonstrated success across multiple computer vision tasks.
Authors:Bo Zeng, Chenyang Lyu, Sinuo Liu, Mingyan Zeng, Minghao Wu, Xuanfan Ni, Tianqi Shi, Yu Zhao, Yefeng Liu, Chenyu Zhu, Ruizhe Li, Jiahui Geng, Qing Li, Yu Tong, Longyue Wang, Weihua Luo, Kaifu Zhang
Abstract:
Instruction-following capability has become a major ability to be evaluated for Large Language Models (LLMs). However, existing datasets, such as IFEval, are either predominantly monolingual and centered on English or simply machine translated to other languages, limiting their applicability in multilingual contexts. In this paper, we present an carefully-curated extension of IFEval to a localized multilingual version named Marco-Bench-MIF, covering 30 languages with varying levels of localization. Our benchmark addresses linguistic constraints (e.g., modifying capitalization requirements for Chinese) and cultural references (e.g., substituting region-specific company names in prompts) via a hybrid pipeline combining translation with verification. Through comprehensive evaluation of 20+ LLMs on our Marco-Bench-MIF, we found that: (1) 25-35% accuracy gap between high/low-resource languages, (2) model scales largely impact performance by 45-60% yet persists script-specific challenges, and (3) machine-translated data underestimates accuracy by7-22% versus localized data. Our analysis identifies challenges in multilingual instruction following, including keyword consistency preservation and compositional constraint adherence across languages. Our Marco-Bench-MIF is available at https://github.com/AIDC-AI/Marco-Bench-MIF.
中文: 本文提出了Marco-Bench-MIF多语言基准,通过混合翻译验证流程将IFeval扩展至30种语言,研究发现高低资源语言间存在25-35%的性能差距,且机器翻译数据会低估模型7-22%的准确率。
English: This paper introduces Marco-Bench-MIF, a localized multilingual benchmark extending IFEval to 30 languages through a hybrid translation-verification pipeline, revealing significant performance gaps between high- and low-resource languages and demonstrating that machine-translated data underestimates model accuracy by 7-22% compared to culturally adapted data.
Authors:Peiwen Xia, Tangfei Liao, Wei Zhu, Danhuai Zhao, Jianjun Ke, Kaihao Zhang, Tong Lu, Tao Wang
Abstract:
Establishing reliable correspondences between image pairs is a fundamental task in computer vision, underpinning applications such as 3D reconstruction and visual localization. Although recent methods have made progress in pruning outliers from dense correspondence sets, they often hypothesize consistent visual domains and overlook the challenges posed by diverse scene structures. In this paper, we propose CorrMoE, a novel correspondence pruning framework that enhances robustness under cross-domain and cross-scene variations. To address domain shift, we introduce a De-stylization Dual Branch, performing style mixing on both implicit and explicit graph features to mitigate the adverse influence of domain-specific representations. For scene diversity, we design a Bi-Fusion Mixture of Experts module that adaptively integrates multi-perspective features through linear-complexity attention and dynamic expert routing. Extensive experiments on benchmark datasets demonstrate that CorrMoE achieves superior accuracy and generalization compared to state-of-the-art methods. The code and pre-trained models are available at https://github.com/peiwenxia/CorrMoE.
中文摘要:CorrMoE提出了一种新颖的对应关系筛选框架,通过去风格化双分支和双融合专家混合模块解决跨域跨场景挑战,在基准测试中展现出卓越的准确性和泛化能力。
English Summary: CorrMoE is a robust correspondence pruning framework that addresses cross-domain and cross-scene challenges through a de-stylization dual branch and bi-fusion mixture of experts module, achieving superior performance on benchmark datasets.
Authors:Haoxuan Zhang, Ruochi Li, Yang Zhang, Ting Xiao, Jiangping Chen, Junhua Ding, Haihua Chen
Abstract:
Scientific innovation is undergoing a paradigm shift driven by the rapid advancement of Large Language Models (LLMs). As science faces mounting challenges including information overload, disciplinary silos, and diminishing returns on conventional research methods, LLMs are emerging as powerful agents capable not only of enhancing scientific workflows but also of participating in and potentially leading the innovation process. Existing surveys mainly focus on different perspectives, phrases, and tasks in scientific research and discovery, while they have limitations in understanding the transformative potential and role differentiation of LLM. This survey proposes a comprehensive framework to categorize the evolving roles of LLMs in scientific innovation across three hierarchical levels: Evaluator, Collaborator, and Scientist. We distinguish between LLMs' contributions to structured scientific research processes and open-ended scientific discovery, thereby offering a unified taxonomy that clarifies capability boundaries, evaluation criteria, and human-AI interaction patterns at each level. Through an extensive analysis of current methodologies, benchmarks, systems, and evaluation metrics, this survey delivers an in-depth and systematic synthesis on LLM-driven scientific innovation. We present LLMs not only as tools for automating existing processes, but also as catalysts capable of reshaping the epistemological foundations of science itself. This survey offers conceptual clarity, practical guidance, and theoretical foundations for future research, while also highlighting open challenges and ethical considerations in the pursuit of increasingly autonomous AI-driven science. Resources related to this survey can be accessed on GitHub at: https://github.com/haoxuan-unt2024/llm4innovation.
中文: 本综述提出将大语言模型在科学创新中的角色划分为评估者、协作者和科学家三个层次,强调其不仅是自动化工具更是重塑科学认识论的催化剂,同时探讨了相关挑战与伦理问题。
English: This survey introduces a hierarchical framework categorizing LLMs' roles in scientific innovation as Evaluator, Collaborator, and Scientist, highlighting their transformative potential beyond automation to reshape scientific epistemology while addressing challenges and ethical considerations.
Authors:Ruofan Hu, Dongyu Zhang, Huayi Zhang, Elke Rundensteiner
Abstract:
Learning with noisy labels (LNL) is essential for training deep neural networks with imperfect data. Meta-learning approaches have achieved success by using a clean unbiased labeled set to train a robust model. However, this approach heavily depends on the availability of a clean labeled meta-dataset, which is difficult to obtain in practice. In this work, we thus tackle the challenge of meta-learning for noisy label scenarios without relying on a clean labeled dataset. Our approach leverages the data itself while bypassing the need for labels. Building on the insight that clean samples effectively preserve the consistency of related data structures across the last hidden and the final layer, whereas noisy samples disrupt this consistency, we design the Cross-layer Information Divergence-based Meta Update Strategy (CLID-MU). CLID-MU leverages the alignment of data structures across these diverse feature spaces to evaluate model performance and use this alignment to guide training. Experiments on benchmark datasets with varying amounts of labels under both synthetic and real-world noise demonstrate that CLID-MU outperforms state-of-the-art methods. The code is released at https://github.com/ruofanhu/CLID-MU.
Chinese: 本文提出CLID-MU方法,通过利用跨层数据结构一致性在无需干净标注数据的情况下训练噪声标签的鲁棒模型,在基准数据集上超越了现有最优方法。
English: This paper introduces CLID-MU, a meta-learning method that trains robust models on noisy labels by leveraging cross-layer data structure consistency without requiring clean labeled data, and it outperforms existing techniques on benchmark datasets.
Authors:Hendrik KraÃ, Ju Huang, Seyed Mohamad Moosavi
Abstract:
Universal machine learning interatomic potentials (uMLIPs) have emerged as powerful tools for accelerating atomistic simulations, offering scalable and efficient modeling with accuracy close to quantum calculations. However, their reliability and effectiveness in practical, real-world applications remain an open question. Metal-organic frameworks (MOFs) and related nanoporous materials are highly porous crystals with critical relevance in carbon capture, energy storage, and catalysis applications. Modeling nanoporous materials presents distinct challenges for uMLIPs due to their diverse chemistry, structural complexity, including porosity and coordination bonds, and the absence from existing training datasets. Here, we introduce MOFSimBench, a benchmark to evaluate uMLIPs on key materials modeling tasks for nanoporous materials, including structural optimization, molecular dynamics (MD) stability, the prediction of bulk properties, such as bulk modulus and heat capacity, and guest-host interactions. Evaluating over 20 models from various architectures on a chemically and structurally diverse materials set, we find that top-performing uMLIPs consistently outperform classical force fields and fine-tuned machine learning potentials across all tasks, demonstrating their readiness for deployment in nanoporous materials modeling. Our analysis highlights that data quality, particularly the diversity of training sets and inclusion of out-of-equilibrium conformations, plays a more critical role than model architecture in determining performance across all evaluated uMLIPs. We release our modular and extendable benchmarking framework at https://github.com/AI4ChemS/mofsim-bench, providing an open resource to guide the adoption for nanoporous materials modeling and further development of uMLIPs.
中文: 通用机器学习原子间势在纳米多孔材料建模中表现出优于传统力场和微调模型的性能,其中数据质量比模型架构对可靠性更为关键。
English: Universal machine learning interatomic potentials (uMLIPs) demonstrate superior performance over classical force fields and fine-tuned models across key nanoporous materials modeling tasks, with data quality proving more crucial than model architecture for reliability.
Authors:Nak-Jun Sung, Jun Ma, TaeHeon Kim, Yoo-joo Choi, Min-Hyung Choi, Min Hong
Abstract:
This study explores the capabilities of WebGPU, an emerging web graphics paradigm, for real-time cloth simulation. Traditional WebGL-based methods have been in handling complex physical simulations due to their emphasis on graphics rendering rather than general-purpose GPU (GPGPU) operations. WebGPU, designed to provide modern 3D graphics and computational capabilities, offers significant improvements through parallel processing and support for computational shaders. In this work, we implemented a cloth simulation system using the Mass-Spring Method within the WebGPU framework, integrating collision detection and response handling with the 3D surface model. First, comparative performance evaluations demonstrate that WebGPU substantially outperforms WebGL, particularly in high-resolution simulations, maintaining 60 frames per second (fps) even with up to 640K nodes. The second experiment aimed to determine the real-time limitations of WebGPU and confirmed that WebGPU can handle real-time collisions between 4K and 100k cloth node models and a 100K triangle surface model in real-time. These experiments also highlight the importance of balancing real-time performance with realistic rendering when handling collisions between cloth models and complex 3D objects. Our source code is available at https://github.com/nakjun/Cloth-Simulation-WebGPU
中文: 本研究证明WebGPU在实时布料模拟中显著优于WebGL,可在64万节点下保持60帧/秒,并能实时处理复杂布料与表面模型之间的碰撞。
English: This study demonstrates that WebGPU significantly outperforms WebGL in real-time cloth simulation, maintaining 60 fps with 640K nodes and handling real-time collisions between complex cloth and surface models.
Authors:Ivan Viakhirev, Daniil Sirota, Aleksandr Smirnov, Kirill Borodin
Abstract:
Advances in voice conversion and text-to-speech synthesis have made automatic speaker verification (ASV) systems more susceptible to spoofing attacks. This work explores modest refinements to the AASIST anti-spoofing architecture. It incorporates a frozen Wav2Vec 2.0 encoder to retain self-supervised speech representations in limited-data settings, substitutes the original graph attention block with a standardized multi-head attention module using heterogeneous query projections, and replaces heuristic frame-segment fusion with a trainable, context-aware integration layer. When evaluated on the ASVspoof 5 corpus, the proposed system reaches a 7.6\% equal error rate (EER), improving on a re-implemented AASIST baseline under the same training conditions. Ablation experiments suggest that each architectural change contributes to the overall performance, indicating that targeted adjustments to established models may help strengthen speech deepfake detection in practical scenarios. The code is publicly available at https://github.com/KORALLLL/AASIST_SCALING.
Chinese: 本研究通过引入冻结的Wav2Vec 2.0编码器、用异质查询的多头注意力替换图注意力机制,并采用可训练的融合层来改进AASIST反欺骗模型,在ASVspoof 5语料库上实现7.6%等错误率,验证了各模块优化对提升语音深度伪造检测效果的有效性。
English: This study enhances the AASIST anti-spoofing model by integrating a frozen Wav2Vec 2.0 encoder, replacing graph attention with multi-head attention using heterogeneous queries, and implementing a trainable fusion layer, achieving a 7.6% EER on the ASVspoof 5 corpus and demonstrating that each modification contributes to improved speech deepfake detection.
Authors:Jay Revolinsky, Harry Shomer, Jiliang Tang
Abstract:
Graphs Neural Networks (GNNs) demonstrate high-performance on the link prediction (LP) task. However, these models often rely on all dataset samples being drawn from the same distribution. In addition, graph generative models (GGMs) show a pronounced ability to generate novel output graphs. Despite this, GGM applications remain largely limited to domain-specific tasks. To bridge this gap, we propose FLEX as a GGM framework which leverages two mechanism: (1) structurally-conditioned graph generation, and (2) adversarial co-training between an auto-encoder and GNN. As such, FLEX ensures structural-alignment between sample distributions to enhance link-prediction performance in out-of-distribution (OOD) scenarios. Notably, FLEX does not require expert knowledge to function in different OOD scenarios. Numerous experiments are conducted in synthetic and real-world OOD settings to demonstrate FLEX's performance-enhancing ability, with further analysis for understanding the effects of graph data augmentation on link structures. The source code is available here: https://github.com/revolins/FlexOOD.
中文:提出的FLEX框架通过结构条件图生成与对抗协同训练相结合,无需专家知识即可提升分布外场景下的链接预测性能。
English: The proposed FLEX framework enhances link prediction in out-of-distribution scenarios by combining structurally-conditioned graph generation with adversarial co-training, eliminating the need for expert knowledge while improving performance.
Authors:Stylianos Savva
Abstract:
We present a norm-stabilized imaginary-time evolution (ITE) scheme for the one-dimensional nonlinear Schrodinger equation (NLSE). Traditional ITE solvers often require explicit renormalization of the wavefunction after each step to preserve norm, which can be disruptive and algorithmically inflexible. We propose an alternative approach in which the evolution is continuously stabilized using an adaptive feedback term mu(tau), proportional to the time derivative of the wavefunction norm. This results in a self-regulating flow that requires no external normalization while preserving convergence toward soliton solutions. We demonstrate the method's effectiveness by comparing the final wavefunction profiles and L2 errors against analytical solutions and baseline methods without feedback. Although this work focuses on the 1D case, the framework is designed to extend naturally to higher dimensions. Future work will explore the behavior of the feedback mechanism in 2D and 3D systems, multi-soliton scenarios, and external potentials.
中文: 本文针对一维非线性薛定谔方程提出了一种范数稳定虚时演化方法,通过自适应反馈项无需外部重归一化即可保持波函数范数,同时确保收敛至孤子解。
English: This paper introduces a norm-stabilized imaginary-time evolution method for the 1D nonlinear Schrödinger equation, using an adaptive feedback term to maintain wavefunction norm without external renormalization while ensuring convergence to soliton solutions.
Authors:Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, Zsolt Kira
Abstract:
Verifiers -- functions assigning rewards to agent behavior -- have been key for AI progress in domains like math and board games. However, extending these gains to domains without clear-cut success criteria (e.g.,computer use) remains a challenge: while humans can recognize suitable outcomes, translating this intuition into scalable rules is non-trivial. Multimodal Large Language Models(MLLMs) emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills. We evaluate MLLMs as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation, and identify a critical limitation: agreement bias, a strong tendency for MLLMs to favor information in their context window, often generating chains of thought to rationalize flawed behavior. This bias is pervasive across models, resilient to test-time scaling, and can impact several methods using MLLMs as evaluators (e.g.,data filtering). Notably, it occurs despite MLLMs showing strong, human-aligned priors on desired behavior. To address this, we propose Self-Grounded Verification (SGV), a lightweight method that enables more effective use of MLLMs' knowledge and reasoning by harnessing their own sampling mechanisms via unconditional and conditional generation. SGV operates in two steps: first, the MLLM is elicited to retrieve broad priors about task completion, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Enhanced with SGV, MLLM verifiers show gains of up to 20 points in accuracy and failure detection rates, and can perform real-time supervision of heterogeneous agents, boosting task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena -- setting a new state of the art on the benchmark, surpassing the previous best by 48%.
Chinese Summary: 多模态大语言模型在验证智能体行为方面潜力显著,但存在认同偏差问题;通过提出的自基础验证方法,该问题得到有效解决,大幅提升了模型在多项任务中的准确性和表现。
English Summary: Multimodal Large Language Models (MLLMs) show promise as verifiers for agent behavior but suffer from agreement bias, which is addressed by the proposed Self-Grounded Verification method that significantly improves their accuracy and performance across various tasks.
Authors:Benjamin Keel, Aaron Quyn, David Jayne, Maryam Mohsin, Samuel D. Relton
Abstract:
Effective treatment for rectal cancer relies on accurate lymph node metastasis (LNM) staging. However, radiological criteria based on lymph node (LN) size, shape and texture morphology have limited diagnostic accuracy. In this work, we investigate applying a Variational Autoencoder (VAE) as a feature encoder model to replace the large pre-trained Convolutional Neural Network (CNN) used in existing approaches. The motivation for using a VAE is that the generative model aims to reconstruct the images, so it directly encodes visual features and meaningful patterns across the data. This leads to a disentangled and structured latent space which can be more interpretable than a CNN. Models are deployed on an in-house MRI dataset with 168 patients who did not undergo neo-adjuvant treatment. The post-operative pathological N stage was used as the ground truth to evaluate model predictions. Our proposed model 'VAE-MLP' achieved state-of-the-art performance on the MRI dataset, with cross-validated metrics of AUC 0.86 +/- 0.05, Sensitivity 0.79 +/- 0.06, and Specificity 0.85 +/- 0.05. Code is available at: https://github.com/benkeel/Lymph_Node_Classification_MIUA.
中文: 本研究提出了一种VAE-MLP模型,利用变分自编码器进行特征编码以改进直肠癌淋巴结转移分期,在MRI数据集上取得了最佳性能,AUC达到0.86。
English: This study introduces a VAE-MLP model that uses a variational autoencoder for feature encoding to improve lymph node metastasis staging in rectal cancer, achieving state-of-the-art performance with an AUC of 0.86 on an MRI dataset.
Authors:Steven Dillmann, Juan Rafael MartÃnez-Galarza
Abstract:
Event time series are sequences of discrete events occurring at irregular time intervals, each associated with a domain-specific observational modality. They are common in domains such as high-energy astrophysics, computational social science, cybersecurity, finance, healthcare, neuroscience, and seismology. Their unstructured and irregular structure poses significant challenges for extracting meaningful patterns and identifying salient phenomena using conventional techniques. We propose novel two- and three-dimensional tensor representations for event time series, coupled with sparse autoencoders that learn physically meaningful latent representations. These embeddings support a variety of downstream tasks, including anomaly detection, similarity-based retrieval, semantic clustering, and unsupervised classification. We demonstrate our approach on a real-world dataset from X-ray astronomy, showing that these representations successfully capture temporal and spectral signatures and isolate diverse classes of X-ray transients. Our framework offers a flexible, scalable, and generalizable solution for analyzing complex, irregular event time series across scientific and industrial domains.
中文摘要:本文针对不规则事件时间序列提出了新型张量表示和稀疏自编码器方法,通过X射线天文数据验证了其在异常检测和分类任务中的有效性,为跨领域复杂数据分析提供了通用解决方案。
English Summary: The paper introduces novel tensor representations and sparse autoencoders to analyze irregular event time series, enabling effective anomaly detection and classification across various domains as demonstrated with X-ray astronomy data.
Authors:Sandeep Suresh Cranganore, Andrei Bodnar, Arturs Berzins, Johannes Brandstetter
Abstract:
We introduce Einstein Fields, a neural representation that is designed to compress computationally intensive four-dimensional numerical relativity simulations into compact implicit neural network weights. By modeling the \emph{metric}, which is the core tensor field of general relativity, Einstein Fields enable the derivation of physical quantities via automatic differentiation. However, unlike conventional neural fields (e.g., signed distance, occupancy, or radiance fields), Einstein Fields are \emph{Neural Tensor Fields} with the key difference that when encoding the spacetime geometry of general relativity into neural field representations, dynamics emerge naturally as a byproduct. Einstein Fields show remarkable potential, including continuum modeling of 4D spacetime, mesh-agnosticity, storage efficiency, derivative accuracy, and ease of use. We address these challenges across several canonical test beds of general relativity and release an open source JAX-based library, paving the way for more scalable and expressive approaches to numerical relativity. Code is made available at https://github.com/AndreiB137/EinFields
中文: 爱因斯坦场是一种神经张量场表示法,将复杂的四维数值相对论模拟压缩为紧凑的神经网络权重,通过自动微分精确推导物理量,同时自然地捕捉时空动力学。
English: Einstein Fields is a neural tensor field representation that compresses complex 4D numerical relativity simulations into compact neural network weights, enabling accurate derivation of physical quantities through automatic differentiation while naturally capturing spacetime dynamics.
Authors:Hanxue Gu, Yaqian Chen, Nicholas Konz, Qihang Li, Maciej A. Mazurowski
Abstract:
Foundation models, pre-trained on large image datasets and capable of capturing rich feature representations, have recently shown potential for zero-shot image registration. However, their performance has mostly been tested in the context of rigid or less complex structures, such as the brain or abdominal organs, and it remains unclear whether these models can handle more challenging, deformable anatomy. Breast MRI registration is particularly difficult due to significant anatomical variation between patients, deformation caused by patient positioning, and the presence of thin and complex internal structure of fibroglandular tissue, where accurate alignment is crucial. Whether foundation model-based registration algorithms can address this level of complexity remains an open question. In this study, we provide a comprehensive evaluation of foundation model-based registration algorithms for breast MRI. We assess five pre-trained encoders, including DINO-v2, SAM, MedSAM, SSLSAM, and MedCLIP, across four key breast registration tasks that capture variations in different years and dates, sequences, modalities, and patient disease status (lesion versus no lesion). Our results show that foundation model-based algorithms such as SAM outperform traditional registration baselines for overall breast alignment, especially under large domain shifts, but struggle with capturing fine details of fibroglandular tissue. Interestingly, additional pre-training or fine-tuning on medical or breast-specific images in MedSAM and SSLSAM, does not improve registration performance and may even decrease it in some cases. Further work is needed to understand how domain-specific training influences registration and to explore targeted strategies that improve both global alignment and fine structure accuracy. We also publicly release our code at \href{https://github.com/mazurowski-lab/Foundation-based-reg}{Github}.
中文: 基础模型在零样本乳腺MRI配准中展现出潜力,能在大域偏移下超越传统方法的整体对齐效果,但难以精确捕捉纤维腺体组织的细微结构,且特定领域的训练并未持续提升性能。
English: Foundation models show promise for zero-shot breast MRI registration by outperforming traditional methods in overall alignment under large domain shifts, but they struggle with accurately capturing fine fibroglandular tissue details, and domain-specific training does not consistently enhance performance.
Authors:Zejian Li, Yize Li, Chenye Meng, Zhongni Liu, Yang Ling, Shengyuan Zhang, Guang Yang, Changyuan Yang, Zhiyuan Yang, Lingyun Sun
Abstract:
Recent advancements in diffusion models (DMs) have been propelled by alignment methods that post-train models to better conform to human preferences. However, these approaches typically require computation-intensive training of a base model and a reward model, which not only incurs substantial computational overhead but may also compromise model accuracy and training efficiency. To address these limitations, we propose Inversion-DPO, a novel alignment framework that circumvents reward modeling by reformulating Direct Preference Optimization (DPO) with DDIM inversion for DMs. Our method conducts intractable posterior sampling in Diffusion-DPO with the deterministic inversion from winning and losing samples to noise and thus derive a new post-training paradigm. This paradigm eliminates the need for auxiliary reward models or inaccurate appromixation, significantly enhancing both precision and efficiency of training. We apply Inversion-DPO to a basic task of text-to-image generation and a challenging task of compositional image generation. Extensive experiments show substantial performance improvements achieved by Inversion-DPO compared to existing post-training methods and highlight the ability of the trained generative models to generate high-fidelity compositionally coherent images. For the post-training of compostitional image geneation, we curate a paired dataset consisting of 11,140 images with complex structural annotations and comprehensive scores, designed to enhance the compositional capabilities of generative models. Inversion-DPO explores a new avenue for efficient, high-precision alignment in diffusion models, advancing their applicability to complex realistic generation tasks. Our code is available at https://github.com/MIGHTYEZ/Inversion-DPO
中文: Inversion-DPO提出了一种新颖的对齐框架,通过将DDIM反演与直接偏好优化相结合,绕过了奖励建模,在文本到图像和组合图像生成等任务中显著提升了扩散模型的训练精度和效率。
English: Inversion-DPO introduces a novel alignment framework that bypasses reward modeling by integrating DDIM inversion with Direct Preference Optimization, enhancing training precision and efficiency for diffusion models in tasks like text-to-image and compositional image generation.
Authors:Ann-Kathrin Dombrowski, Dillon Bowen, Adam Gleave, Chris Cundy
Abstract:
Open-weight large language models (LLMs) unlock huge benefits in innovation, personalization, privacy, and democratization. However, their core advantage - modifiability - opens the door to systemic risks: bad actors can trivially subvert current safeguards, turning beneficial models into tools for harm. This leads to a 'safety gap': the difference in dangerous capabilities between a model with intact safeguards and one that has been stripped of those safeguards. We open-source a toolkit to estimate the safety gap for state-of-the-art open-weight models. As a case study, we evaluate biochemical and cyber capabilities, refusal rates, and generation quality of models from two families (Llama-3 and Qwen-2.5) across a range of parameter scales (0.5B to 405B) using different safeguard removal techniques. Our experiments reveal that the safety gap widens as model scale increases and effective dangerous capabilities grow substantially when safeguards are removed. We hope that the Safety Gap Toolkit (https://github.com/AlignmentResearch/safety-gap) will serve as an evaluation framework for common open-source models and as a motivation for developing and testing tamper-resistant safeguards. We welcome contributions to the toolkit from the community.
开源权重大语言模型虽带来诸多益处,但其可修改性也导致系统性风险——恶意行为者可轻易绕过安全措施形成安全缺口,我们的开源工具包通过评估发现模型规模越大该缺口越宽。
Open-weight LLMs offer significant benefits but pose systemic risks as their modifiability allows easy removal of safeguards, creating a safety gap that widens with model scale, which we measure using our open-source toolkit.
Authors:Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu
Abstract:
Perceiving and reconstructing 4D spatial-temporal geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and real-time applications, we propose a streaming 4D visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 4D reconstruction. This design can handle real-time 4D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operator (e.g., FlashAttention) from the field of large language models. Extensive experiments on various 4D geometry perception benchmarks demonstrate that our model increases the inference speed in online scenarios while maintaining competitive performance, paving the way for scalable and interactive 4D vision systems. Code is available at: https://github.com/wzzheng/StreamVGGT.
中文: 本文提出了一种流式4D视觉几何变换器,通过因果注意力和知识蒸馏技术,实现了从视频中进行实时高质量的4D重建,在保持竞争力的同时显著提升了交互应用的推理速度。
English: This paper introduces a streaming 4D visual geometry transformer that enables real-time, high-quality 4D reconstruction from videos by using causal attention and knowledge distillation, achieving competitive performance with faster inference for interactive applications.
Authors:Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yunpeng Chen, Yunchao Wei
Abstract:
In text-to-image generation, producing a series of consistent contents that preserve the same identity is highly valuable for real-world applications. Although a few works have explored training-free methods to enhance the consistency of generated subjects, we observe that they suffer from the following problems. First, they fail to maintain consistent background details, which limits their applicability. Furthermore, when the foreground character undergoes large motion variations, inconsistencies in identity and clothing details become evident. To address these problems, we propose CharaConsist, which employs point-tracking attention and adaptive token merge along with decoupled control of the foreground and background. CharaConsist enables fine-grained consistency for both foreground and background, supporting the generation of one character in continuous shots within a fixed scene or in discrete shots across different scenes. Moreover, CharaConsist is the first consistent generation method tailored for text-to-image DiT model. Its ability to maintain fine-grained consistency, combined with the larger capacity of latest base model, enables it to produce high-quality visual outputs, broadening its applicability to a wider range of real-world scenarios. The source code has been released at https://github.com/Murray-Wang/CharaConsist
中文摘要:CharaConsist方法通过采用点跟踪注意力和自适应令牌合并技术,解决了文本生成图像中主体一致性问题,实现了前景和背景的精细一致性控制,适用于固定场景连续镜头和不同场景离散镜头的角色生成。
English Summary: The proposed CharaConsist method addresses subject consistency issues in text-to-image generation by implementing point-tracking attention and adaptive token merging, enabling fine-grained foreground and background consistency across continuous or discrete scenes.
Authors:Yinsheng Li, Zhen Dong, Yi Shao
Abstract:
Large Language Model (LLM) agents have shown great potential for solving real-world problems and promise to be a solution for tasks automation in industry. However, more benchmarks are needed to systematically evaluate automation agents from an industrial perspective, for example, in Civil Engineering. Therefore, we propose DrafterBench for the comprehensive evaluation of LLM agents in the context of technical drawing revision, a representation task in civil engineering. DrafterBench contains twelve types of tasks summarized from real-world drawing files, with 46 customized functions/tools and 1920 tasks in total. DrafterBench is an open-source benchmark to rigorously test AI agents' proficiency in interpreting intricate and long-context instructions, leveraging prior knowledge, and adapting to dynamic instruction quality via implicit policy awareness. The toolkit comprehensively assesses distinct capabilities in structured data comprehension, function execution, instruction following, and critical reasoning. DrafterBench offers detailed analysis of task accuracy and error statistics, aiming to provide deeper insight into agent capabilities and identify improvement targets for integrating LLMs in engineering applications. Our benchmark is available at https://github.com/Eason-Li-AIS/DrafterBench, with the test set hosted at https://huggingface.co/datasets/Eason666/DrafterBench.
中文: DrafterBench是一个开源基准测试,通过1920个真实世界土木工程任务全面评估大语言模型代理在技术图纸修订中的指令解析、函数执行和推理能力,旨在为工程应用提供改进方向。
English: DrafterBench is an open-source benchmark designed to comprehensively evaluate LLM agents' capabilities in technical drawing revision tasks, assessing their proficiency in instruction interpretation, function execution, and reasoning through 1920 real-world civil engineering tasks.
Authors:Harsha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Haritha Ananthakrishnan, Oktie Hassanzadeh, Horst Samulowitz, Kavitha Srinivas
Abstract:
One of the major challenges in enterprise data analysis is the task of finding joinable tables that are conceptually related and provide meaningful insights. Traditionally, joinable tables have been discovered through a search for similar columns, where two columns are considered similar syntactically if there is a set overlap or they are considered similar semantically if either the column embeddings or value embeddings are closer in the embedding space. However, for enterprise data lakes, column similarity is not sufficient to identify joinable columns and tables. The context of the query column is important. Hence, in this work, we first define context-aware column joinability. Then we propose a multi-criteria approach, called TOPJoin, for joinable column search. We evaluate TOPJoin against existing join search baselines over one academic and one real-world join search benchmark. Through experiments, we find that TOPJoin performs better on both benchmarks than the baselines.
Chinese Summary: 摘要提出了TOPJoin方法,通过引入上下文感知的列连接性定义,克服了传统基于列相似性的连接表发现方法的局限,并在学术和实际应用基准测试中展现出优于现有技术的性能。
English Summary: The abstract introduces TOPJoin, a multi-criteria approach that addresses the limitations of traditional joinable table discovery by incorporating context-aware column joinability, demonstrating superior performance over existing methods on academic and real-world benchmarks.
Authors:Christian Daniele, Silvia Villa, Samuel Vaiter, Luca Calatroni
Abstract:
Deep Equilibrium Models (DEQs) are implicit neural networks with fixed points, which have recently gained attention for learning image regularization functionals, particularly in settings involving Gaussian fidelities, where assumptions on the forward operator ensure contractiveness of standard (proximal) Gradient Descent operators. In this work, we extend the application of DEQs to Poisson inverse problems, where the data fidelity term is more appropriately modeled by the Kullback-Leibler divergence. To this end, we introduce a novel DEQ formulation based on Mirror Descent defined in terms of a tailored non-Euclidean geometry that naturally adapts with the structure of the data term. This enables the learning of neural regularizers within a principled training framework. We derive sufficient conditions to guarantee the convergence of the learned reconstruction scheme and propose computational strategies that enable both efficient training and fully parameter-free inference. Numerical experiments show that our method outperforms traditional model-based approaches and it is comparable to the performance of Bregman Plug-and-Play methods, while mitigating their typical drawbacks - namely, sensitivity to initialization and careful tuning of hyperparameters. The code is publicly available at https://github.com/christiandaniele/DEQ-MD.
中文: 本研究将深度平衡模型扩展应用于泊松逆问题,通过引入基于镜像下降的新颖公式并利用定制化的非欧几何结构,实现了高效学习神经正则化器,保证收敛性,在超越传统方法的同时减轻了对初始化和超参数调优的敏感性。
English: This study extends Deep Equilibrium Models (DEQs) to Poisson inverse problems by introducing a novel Mirror Descent-based formulation that leverages tailored non-Euclidean geometry, enabling efficient learning of neural regularizers with guaranteed convergence and outperforming traditional approaches while mitigating sensitivity to initialization and hyperparameters.
Authors:Shuo Yang, Zixin Zhang, John Z. Zhang, Ibrahima Sory Sow, Zachary Manchester
Abstract:
This paper presents a state-estimation solution for legged robots that uses a set of low-cost, compact, and lightweight sensors to achieve low-drift pose and velocity estimation under challenging locomotion conditions. The key idea is to leverage multiple inertial measurement units on different links of the robot to correct a major error source in standard proprioceptive odometry. We fuse the inertial sensor information and joint encoder measurements in an extended Kalman filter, then combine the velocity estimate from this filter with camera data in a factor-graph-based sliding-window estimator to form a visual-inertial-leg odometry method. We validate our state estimator through comprehensive theoretical analysis and hardware experiments performed using real-world robot data collected during a variety of challenging locomotion tasks. Our algorithm consistently achieves minimal position deviation, even in scenarios involving substantial ground impact, foot slippage, and sudden body rotations. A C++ implementation, along with a large-scale dataset, is available at https://github.com/ShuoYangRobotics/Cerberus2.0.
中文: 本文提出了一种低成本的多传感器状态估计方法,通过融合多个惯性测量单元和关节编码器数据,使腿式机器人即使在冲击和打滑等挑战性条件下也能实现精确的位姿与速度估计。
English: This paper introduces a low-cost, multi-sensor state-estimation method for legged robots that combines inertial data and joint measurements to achieve accurate pose and velocity tracking even under challenging conditions like impacts and slippage.
Authors:Kaif Shaikh, Franziska Boenisch, Adam Dziedzic
Abstract:
Vision AutoRegressive model (VAR) was recently introduced as an alternative to Diffusion Models (DMs) in image generation domain. In this work we focus on its adaptations, which aim to fine-tune pre-trained models to perform specific downstream tasks, like medical data generation. While for DMs there exist many techniques, adaptations for VAR remain underexplored. Similarly, differentially private (DP) adaptations-ones that aim to preserve privacy of the adaptation data-have been extensively studied for DMs, while VAR lacks such solutions. In our work, we implement and benchmark many strategies for VAR, and compare them to state-of-the-art DM adaptation strategies. We observe that VAR outperforms DMs for non-DP adaptations, however, the performance of DP suffers, which necessitates further research in private adaptations for VAR. Code is available at https://github.com/sprintml/finetuning_var_dp.
Chinese: VAR模型在非隐私适应方面优于扩散模型,但在差分隐私适应中表现不佳,亟需进一步研究以提升其隐私保护性能。
English: The VAR model surpasses DMs in non-private adaptations but requires further research for effective differentially private implementations, as current DP strategies underperform.
Authors:Hongbo Ye, Fenghe Tang, Peiang Zhao, Zhen Huang, Dexin Zhao, Minghao Bian, S. Kevin Zhou
Abstract:
Achieving equity in healthcare accessibility requires lightweight yet high-performance solutions for medical image segmentation, particularly in resource-limited settings. Existing methods like U-Net and its variants often suffer from limited global Effective Receptive Fields (ERFs), hindering their ability to capture long-range dependencies. To address this, we propose U-RWKV, a novel framework leveraging the Recurrent Weighted Key-Value(RWKV) architecture, which achieves efficient long-range modeling at O(N) computational cost. The framework introduces two key innovations: the Direction-Adaptive RWKV Module(DARM) and the Stage-Adaptive Squeeze-and-Excitation Module(SASE). DARM employs Dual-RWKV and QuadScan mechanisms to aggregate contextual cues across images, mitigating directional bias while preserving global context and maintaining high computational efficiency. SASE dynamically adapts its architecture to different feature extraction stages, balancing high-resolution detail preservation and semantic relationship capture. Experiments demonstrate that U-RWKV achieves state-of-the-art segmentation performance with high computational efficiency, offering a practical solution for democratizing advanced medical imaging technologies in resource-constrained environments. The code is available at https://github.com/hbyecoding/U-RWKV.
中文: U-RWKV提出了一种基于RWKV架构的新型医学图像分割框架,通过DARM和SASE模块以O(N)计算成本高效捕获长程依赖关系,在资源有限环境中实现了最先进的性能,为医疗公平性提供了实用解决方案。
English: U-RWKV introduces a novel medical image segmentation framework using the RWKV architecture with DARM and SASE modules to efficiently capture long-range dependencies at O(N) computational cost, achieving state-of-the-art performance for equitable healthcare accessibility in resource-limited settings.
Authors:Pierrick Leroy, Antonio Mastropietro, Marco Nurisso, Francesco Vaccarino
Abstract:
Face Recognition (FR) tasks have made significant progress with the advent of Deep Neural Networks, particularly through margin-based triplet losses that embed facial images into high-dimensional feature spaces. During training, these contrastive losses focus exclusively on identity information as labels. However, we observe a multiscale geometric structure emerging in the embedding space, influenced by interpretable facial (e.g., hair color) and image attributes (e.g., contrast). We propose a geometric approach to describe the dependence or invariance of FR models to these attributes and introduce a physics-inspired alignment metric. We evaluate the proposed metric on controlled, simplified models and widely used FR models fine-tuned with synthetic data for targeted attribute augmentation. Our findings reveal that the models exhibit varying degrees of invariance across different attributes, providing insight into their strengths and weaknesses and enabling deeper interpretability. Code available here: https://github.com/mantonios107/attrs-fr-embs}{https://github.com/mantonios107/attrs-fr-embs
中文: 本研究提出了一种几何方法和物理启发的度量标准,用于分析人脸识别模型对可解释的面部和图像属性的响应,揭示了模型在不同属性上的不变性差异,从而提升了模型的可解释性。
English: This study introduces a geometric approach and a physics-inspired metric to analyze how face recognition models respond to interpretable facial and image attributes, revealing varying levels of invariance across attributes and enhancing model interpretability.
Authors:Jianfei Jiang, Qiankun Liu, Haochen Yu, Hongyuan Liu, Liyong Wang, Jiansheng Chen, Huimin Ma
Abstract:
Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. However, existing MVS methods often struggle with challenging regions, such as textureless regions and reflective surfaces, where feature matching fails. In contrast, monocular depth estimation inherently does not require feature matching, allowing it to achieve robust relative depth estimation in these regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature and depth guided MVS network that integrates powerful priors from a monocular foundation model into multi-view geometry. Firstly, the monocular feature of the reference view is integrated into source view features by the attention mechanism with a newly designed cross-view position encoding. Then, the monocular depth of the reference view is aligned to dynamically update the depth candidates for edge regions during the sampling procedure. Finally, a relative consistency loss is further designed based on the monocular depth to supervise the depth prediction. Extensive experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets, ranking first on the Tanks-and-Temples Intermediate and Advanced benchmarks. The source code is available at https://github.com/JianfeiJ/MonoMVSNet.
中文: MonoMVSNet通过注意力机制和动态深度候选更新,将单目深度与特征先验融入多视角立体视觉,在DTU和Tanks-and-Temples等基准测试中取得了领先性能。
English: MonoMVSNet enhances Multi-View Stereo by integrating monocular depth and feature priors through attention mechanisms and dynamic depth candidate updates, achieving state-of-the-art results on benchmarks like DTU and Tanks-and-Temples.
Authors:Haoran Jin, Meng Li, Xiting Wang, Zhihao Xu, Minlie Huang, Yantao Jia, Defu Lian
Abstract:
Aligning Large Language Models (LLMs) with human values has attracted increasing attention since it provides clarity, transparency, and the ability to adapt to evolving scenarios. In this paper, we introduce a Controlled Value Vector Activation (ConVA) method that directly aligns the internal values of LLMs by interpreting how a value is encoded in their latent representations and modifies relevant activations to ensure consistent values in LLMs. To ensure an accurate and unbiased interpretation, we propose a context-controlled value vector identification method. To consistently control values without sacrificing model performance, we introduce a gated value vector activation method for effective and minimum degree of value control. Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency, and ensures target values even with opposite and potentially malicious input prompts. Source code and data are available at~ https://github.com/hr-jin/ConVA.
中文摘要:本文提出的ConVA方法通过解读和修正大语言模型潜在表征中的价值观编码,在不影响模型性能的前提下实现了对10种基本价值观的最优控制成功率。
English Summary: The paper introduces the ConVA method, which aligns LLMs with human values by interpreting and modifying latent value representations, achieving high control success without compromising performance.
Authors:Huilin Xu, Jian Ding, Jiakun Xu, Ruixiang Wang, Jun Chen, Jinjie Mai, Yanwei Fu, Bernard Ghanem, Feng Xu, Mohamed Elhoseiny
Abstract:
Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements. While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. Specifically, we propose a multi-frame latent prediction strategy that encodes future states in a compressed latent space, preserving task-relevant features. Furthermore, we introduce a unidirectional attention mechanism where video prediction is conditioned on the action, while action prediction remains independent of video prediction. This design allows us to omit video prediction during inference, significantly enhancing efficiency. Experiments on two simulated benchmarks and a real-world setting demonstrate a significant improvement in the success rate over the strong baseline ACT using our method, achieving a \textbf{24.9\%} increase on ALOHA, an \textbf{11.1\%} increase on RoboTwin, and a \textbf{32.5\%} increase in real-world experiments. Our models and code are publicly available at https://github.com/return-sleep/Diffusion_based_imaginative_Coordination.
中文: 本文提出了一种基于扩散的统一框架,通过联合优化视频与动作预测来增强机器人双手协调能力,采用创新的潜在预测策略和单向注意力机制,在模拟和真实环境中均实现了显著性能提升。
English: This paper introduces a unified diffusion-based framework that jointly optimizes video and action prediction for bimanual robot coordination, achieving significant performance improvements across simulated and real-world benchmarks through a novel latent prediction strategy and unidirectional attention mechanism.
Authors:Luohe Shi, Zuchao Li, Lefei Zhang, Guoming Liu, Baoyuan Qi, Hai Zhao
Abstract:
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during inference has emerged as a primary efficiency bottleneck, both in aspects of memory consumption and data transfer bandwidth limitations. To address these challenges, we propose a paradigm called KV-Latent. By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed, only with a small amount of extra training, less than 1\% of pre-training takes. Besides, we enhanced the stability of Rotary Positional Embedding applied on lower-dimensional vectors by modifying its frequency sampling mechanism, avoiding noise introduced by higher frequencies while retaining position attenuation. Our experiments, including both models with Grouped Query Attention and those without, have yielded satisfactory results. Finally, we conducted comparative experiments to study the impact of separately reducing Key and Value components on model's performance. Our approach allows for the construction of more efficient language model systems, and opens the new possibility on KV Cache saving and efficient LLMs. Our code is available at https://github.com/ShiLuohe/KV-Latent.
中文:KV-Latent范式通过将键值向量降维至潜在空间,显著减少KV缓存占用并提升推理效率,仅需少量额外训练即可实现,同时通过改进位置编码机制确保了模型性能的稳定性。
English: The KV-Latent paradigm reduces the Key-Value cache size by down-sampling vectors into a latent space, enhancing inference efficiency with minimal extra training while maintaining model performance through improved positional embedding stability.
Authors:Ronggang Huang, Haoxin Yang, Yan Cai, Xuemiao Xu, Huaidong Zhang, Shengfeng He
Abstract:
3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation. Code is available at https://github.com/visualjason/ViewSRD.
中文:ViewSRD提出了一种结构化多视角分解框架,通过视角感知描述生成和跨模态视角融合来简化复杂多锚点查询,在三维视觉定位任务中实现了最先进的性能。
English: ViewSRD introduces a structured multi-view decomposition framework that simplifies complex multi-anchor queries through perspective-aware description generation and cross-modal view integration, achieving state-of-the-art performance in 3D visual grounding.
Authors:Guanghao Wu, Chen Xu, Hai Song, Chong Wang, Qixing Zhang
Abstract:
Smoke is the first visible indicator of a wildfire.With the advancement of deep learning, image-based smoke detection has become a crucial method for detecting and preventing forest fires. However, the scarcity of smoke image data from forest fires is one of the significant factors hindering the detection of forest fire smoke. Image generation models offer a promising solution for synthesizing realistic smoke images. However, current inpainting models exhibit limitations in generating high-quality smoke representations, particularly manifesting as inconsistencies between synthesized smoke and background contexts. To solve these problems, we proposed a comprehensive framework for generating forest fire smoke images. Firstly, we employed the pre-trained segmentation model and the multimodal model to obtain smoke masks and image captions.Then, to address the insufficient utilization of masks and masked images by inpainting models, we introduced a network architecture guided by mask and masked image features. We also proposed a new loss function, the mask random difference loss, which enhances the consistency of the generated effects around the mask by randomly expanding and eroding the mask edges.Finally, to generate a smoke image dataset using random masks for subsequent detection tasks, we incorporated smoke characteristics and use a multimodal large language model as a filtering tool to select diverse and reasonable smoke images, thereby improving the quality of the synthetic dataset. Experiments showed that our generated smoke images are realistic and diverse, and effectively enhance the performance of forest fire smoke detection models. Code is available at https://github.com/wghr123/MFGDiffusion.
中文: 本研究提出了一种基于扩散模型的综合框架,通过掩码特征引导网络和新型损失函数生成逼真森林火灾烟雾图像,有效解决数据稀缺问题并显著提升烟雾检测模型性能。
English: This study introduces a comprehensive framework for generating realistic forest fire smoke images using advanced diffusion models, addressing data scarcity and enhancing detection model performance through innovative mask-guided networks and loss functions.
Authors:Lyzander Marciano Andrylie, Inaya Rahmanisa, Mahardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji
Abstract:
Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic nature makes it difficult to isolate language-specific units from cross-lingual representations. To address this, we explore sparse autoencoders (SAEs) for their ability to learn monosemantic features that represent concrete and abstract concepts across languages in LLMs. While some of these features are language-independent, the presence of language-specific features remains underexplored. In this work, we introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network. We find that many such features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model's multilingual performance and language output and can be used for language identification with performance comparable to fastText along with more interpretability. Our code is available at https://github.com/LyzanderAndrylie/language-specific-features
中文摘要:本研究提出SAE-LAPE方法,通过稀疏自编码器识别大语言模型中的语言特异性特征,发现这些特征主要集中在中后层,可用于可解释的语言识别和性能分析。
English Summary: This study introduces SAE-LAPE, a method that identifies language-specific features in large language models using sparse autoencoders, revealing their concentration in middle-to-final layers and demonstrating their utility for interpretable language identification and performance analysis.
Authors:Jin Li, Zezhong Ding, Xike Xie
Abstract:
Knowledge graphs (KGs) are vital for enabling knowledge reasoning across various domains. Recent KG reasoning methods that integrate both global and local information have achieved promising results. However, existing methods often suffer from score over-smoothing, which blurs the distinction between correct and incorrect answers and hinders reasoning effectiveness. To address this, we propose DuetGraph, a coarse-to-fine KG reasoning mechanism with dual-pathway global-local fusion. DuetGraph tackles over-smoothing by segregating -- rather than stacking -- the processing of local (via message passing) and global (via attention) information into two distinct pathways, preventing mutual interference and preserving representational discrimination. In addition, DuetGraph introduces a coarse-to-fine optimization, which partitions entities into high- and low-score subsets. This strategy narrows the candidate space and sharpens the score gap between the two subsets, which alleviates over-smoothing and enhances inference quality. Extensive experiments on various datasets demonstrate that DuetGraph achieves state-of-the-art (SOTA) performance, with up to an 8.7% improvement in reasoning quality and a 1.8$\times$ acceleration in training efficiency. Our code is available at https://github.com/USTC-DataDarknessLab/DuetGraph.git.
中文摘要:DuetGraph是一种新颖的知识图谱推理机制,通过将全局和局部信息处理分离为双路径并采用由粗到细的优化策略,有效解决了分数过度平滑问题,在推理质量和训练效率上均实现了显著提升,达到了最先进的性能水平。
English Summary: DuetGraph is a novel knowledge graph reasoning mechanism that addresses score over-smoothing by separating global and local information processing into dual pathways and implementing a coarse-to-fine optimization strategy, achieving state-of-the-art performance with significant improvements in reasoning quality and training efficiency.
Authors:Yuan Yao, Jin Song, Jian Jin
Abstract:
As valuable digital assets, deep neural networks necessitate robust ownership protection, positioning neural network watermarking (NNW) as a promising solution. Among various NNW approaches, weight-based methods are favored for their simplicity and practicality; however, they remain vulnerable to forging and overwriting attacks. To address those challenges, we propose NeuralMark, a robust method built around a hashed watermark filter. Specifically, we utilize a hash function to generate an irreversible binary watermark from a secret key, which is then used as a filter to select the model parameters for embedding. This design cleverly intertwines the embedding parameters with the hashed watermark, providing a robust defense against both forging and overwriting attacks. An average pooling is also incorporated to resist fine-tuning and pruning attacks. Furthermore, it can be seamlessly integrated into various neural network architectures, ensuring broad applicability. Theoretically, we analyze its security boundary. Empirically, we verify its effectiveness and robustness across 13 distinct Convolutional and Transformer architectures, covering five image classification tasks and one text generation task. The source codes are available at https://github.com/AIResearch-Group/NeuralMark.
中文: NeuralMark是一种鲁棒的神经网络水印方法,通过哈希水印滤波器将不可逆水印嵌入模型参数,有效防御伪造、覆盖、微调和剪枝攻击,并兼容多种网络架构。
English: NeuralMark is a robust neural network watermarking method that uses a hashed watermark filter to embed irreversible watermarks into model parameters, effectively defending against forging, overwriting, fine-tuning, and pruning attacks while being compatible with various architectures.
Authors:Afra Kilic, Kim Batselier
Abstract:
Tensor Network (TN) Kernel Machines speed up model learning by representing parameters as low-rank TNs, reducing computation and memory use. However, most TN-based Kernel methods are deterministic and ignore parameter uncertainty. Further, they require manual tuning of model complexity hyperparameters like tensor rank and feature dimensions, often through trial-and-error or computationally costly methods like cross-validation. We propose Bayesian Tensor Network Kernel Machines, a fully probabilistic framework that uses sparsity-inducing hierarchical priors on TN factors to automatically infer model complexity. This enables automatic inference of tensor rank and feature dimensions, while also identifying the most relevant features for prediction, thereby enhancing model interpretability. All the model parameters and hyperparameters are treated as latent variables with corresponding priors. Given the Bayesian approach and latent variable dependencies, we apply a mean-field variational inference to approximate their posteriors. We show that applying a mean-field approximation to TN factors yields a Bayesian ALS algorithm with the same computational complexity as its deterministic counterpart, enabling uncertainty quantification at no extra computational cost. Experiments on synthetic and real-world datasets demonstrate the superior performance of our model in prediction accuracy, uncertainty quantification, interpretability, and scalability.
中文: 提出的贝叶斯张量网络核机构建了一个概率框架,通过稀疏诱导先验自动推断模型复杂度,在保持与确定性方法相当计算效率的同时,实现了不确定性量化和更强的可解释性。
English: The proposed Bayesian Tensor Network Kernel Machines introduce a probabilistic framework that automatically infers model complexity through sparsity-inducing priors, enabling uncertainty quantification and enhanced interpretability while maintaining computational efficiency comparable to deterministic methods.
Authors:Zhifeng Gu, Bing Wang
Abstract:
Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene representation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by disentangling multimodal information into shared and modality-specific components, resulting in a more compact and efficient multimodal scene representation. Extensive experiments demonstrate that our method consistently enhances the representation capability for each modality and is scalable to additional modalities. The code is available at https://github.com/Neal2020GitHub/MMOne.
中文摘要:本研究提出MMOne框架,通过模态建模和分解机制解决模态冲突,将多模态信息解耦为共享与特定成分,从而提升各模态表示能力并支持扩展。
English Summary: The study introduces MMOne, a framework that addresses modality conflicts by modeling unique properties and decomposing multimodal information into shared and specific components, resulting in enhanced and scalable scene representation.
Authors:Hankun Liu, Yujian Zhao, Guanglin Niu
Abstract:
Hard samples pose a significant challenge in person re-identification (ReID) tasks, particularly in clothing-changing person Re-ID (CC-ReID). Their inherent ambiguity or similarity, coupled with the lack of explicit definitions, makes them a fundamental bottleneck. These issues not only limit the design of targeted learning strategies but also diminish the model's robustness under clothing or viewpoint changes. In this paper, we propose a novel multimodal-guided Hard Sample Generation and Learning (HSGL) framework, which is the first effort to unify textual and visual modalities to explicitly define, generate, and optimize hard samples within a unified paradigm. HSGL comprises two core components: (1) Dual-Granularity Hard Sample Generation (DGHSG), which leverages multimodal cues to synthesize semantically consistent samples, including both coarse- and fine-grained hard positives and negatives for effectively increasing the hardness and diversity of the training data. (2) Hard Sample Adaptive Learning (HSAL), which introduces a hardness-aware optimization strategy that adjusts feature distances based on textual semantic labels, encouraging the separation of hard positives and drawing hard negatives closer in the embedding space to enhance the model's discriminative capability and robustness to hard samples. Extensive experiments on multiple CC-ReID benchmarks demonstrate the effectiveness of our approach and highlight the potential of multimodal-guided hard sample generation and learning for robust CC-ReID. Notably, HSAL significantly accelerates the convergence of the targeted learning procedure and achieves state-of-the-art performance on both PRCC and LTCC datasets. The code is available at https://github.com/undooo/TryHarder-ACMMM25.
中文: HSGL框架创新性地融合文本与视觉模态,通过双粒度样本生成和自适应学习来定义、生成及优化困难样本,显著提升了模型在换装行人重识别中的鲁棒性,并取得了最优性能。
English: The proposed HSGL framework innovatively integrates text and visual modalities to define, generate, and optimize hard samples through dual-granularity generation and adaptive learning, significantly enhancing model robustness and achieving state-of-the-art performance in clothing-changing person re-identification.
Authors:Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang
Abstract:
Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.
中文: 基于扩散的大语言模型(dLLMs)存在新颖的安全漏洞,其双向建模和平行解码机制使对抗性提示能够绕过现有安全对齐措施,即使明确暴露有害指令仍可能生成危险内容。
English: Diffusion-based large language models (dLLMs) present novel safety vulnerabilities as their bidirectional modeling and parallel decoding mechanisms allow adversarial prompts to bypass existing alignment safeguards, enabling harmful completions even when unsafe instructions are explicitly exposed.
Authors:Vassilis Sioros, Alexandros Potamianos, Giorgos Paraskevopoulos
Abstract:
In this study, we investigate leveraging cross-attention control for efficient audio editing within auto-regressive models. Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms. Integrating a diffusion-based strategy, influenced by Auffusion, we extend the model's functionality to support refinement edits, establishing a baseline for prompt-guided audio editing. Additionally, we introduce an alternative approach by incorporating MUSICGEN, a pre-trained frozen auto-regressive model, and propose three editing mechanisms, based on Replacement, Reweighting, and Refinement of the attention scores. We employ commonly-used music-specific evaluation metrics and a human study, to gauge time-varying controllability, adherence to global text cues, and overall audio realism. The automatic and human evaluations indicate that the proposed combination of prompt-to-prompt guidance with autoregressive generation models significantly outperforms the diffusion-based baseline in terms of melody, dynamics, and tempo of the generated audio. Our code is available at https://github.com/billsioros/EditGen
本研究提出了一种基于提示到提示的自回归模型音频编辑方法,通过结合交叉注意力控制与MUSICGEN模型,在旋律、动态和节奏方面显著优于基于扩散的基准方法。
This study introduces a prompt-to-prompt approach for audio editing in auto-regressive models, combining cross-attention control with MUSICGEN to outperform diffusion-based methods in melody, dynamics, and tempo.
Authors:Weizhao Ma, Dong Zhou, Yuhui Hu, Zipeng He
Abstract:
Monocular pose estimation of non-cooperative spacecraft is significant for on-orbit service (OOS) tasks, such as satellite maintenance, space debris removal, and station assembly. Considering the high demands on pose estimation accuracy, mainstream monocular pose estimation methods typically consist of keypoint detectors and PnP solver. However, current keypoint detectors remain vulnerable to structural symmetry and partial occlusion of non-cooperative spacecraft. To this end, we propose a graph-based keypoints network for the monocular pose estimation of non-cooperative spacecraft, GKNet, which leverages the geometric constraint of keypoints graph. In order to better validate keypoint detectors, we present a moderate-scale dataset for the spacecraft keypoint detection, named SKD, which consists of 3 spacecraft targets, 90,000 simulated images, and corresponding high-precise keypoint annotations. Extensive experiments and an ablation study have demonstrated the high accuracy and effectiveness of our GKNet, compared to the state-of-the-art spacecraft keypoint detectors. The code for GKNet and the SKD dataset is available at https://github.com/Dongzhou-1996/GKNet.
Chinese: 本文提出GKNet,一种基于图结构的关键点检测网络,通过利用关键点间的几何约束有效解决了非合作航天器结构对称性和局部遮挡对单目姿态估计的影响,并基于新构建的数据集验证了其优越性能。
English: This paper introduces GKNet, a graph-based keypoint detection network that addresses challenges like structural symmetry and occlusion in monocular pose estimation of non-cooperative spacecraft, validated through a new dataset and outperforming existing methods.
Authors:Lirong Zheng, Yanshan Li, Rui Yu, Kaihao Zhang
Abstract:
Transformer-based models exhibit strong global modeling capabilities in single-image dehazing, but their high computational cost limits real-time applicability. Existing methods predominantly rely on spatial-domain features to capture long-range dependencies, which are computationally expensive and often inadequate under complex haze conditions. While some approaches introduce frequency-domain cues, the weak coupling between spatial and frequency branches limits the overall performance. To overcome these limitations, we propose the Dark Channel Guided Frequency-aware Dehazing Network (DGFDNet), a novel dual-domain framework that performs physically guided degradation alignment across spatial and frequency domains. At its core, the DGFDBlock comprises two key modules: 1) the Haze-Aware Frequency Modulator (HAFM), which generates a pixel-level haze confidence map from dark channel priors to adaptively enhance haze-relevant frequency components, thereby achieving global degradation-aware spectral modulation; 2) the Multi-level Gating Aggregation Module (MGAM), which fuses multi-scale features through diverse convolutional kernels and hybrid gating mechanisms to recover fine structural details. Additionally, a Prior Correction Guidance Branch (PCGB) incorporates a closed-loop feedback mechanism, enabling iterative refinement of the prior by intermediate dehazed features and significantly improving haze localization accuracy, especially in challenging outdoor scenes. Extensive experiments on four benchmark haze datasets demonstrate that DGFDNet achieves state-of-the-art performance with superior robustness and real-time efficiency. Code is available at: https://github.com/Dilizlr/DGFDNet.
中文摘要:提出的DGFDNet通过融合暗通道先验与频率感知调制,结合多尺度特征门控融合机制,在基准去雾数据集上实现了最优性能并保持实时处理能力。
English Summary: The proposed DGFDNet introduces a dual-domain dehazing framework that integrates dark channel priors with frequency-aware modulation and multi-scale feature fusion, achieving state-of-the-art performance with real-time efficiency on benchmark datasets.
Authors:Xingyu Zheng, Haotong Qin, Yuye Li, Jiakai Wang, Jinyang Guo, Michele Magno, Xianglong Liu
Abstract:
Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by directly computing the difference between latent and full-precision weights, avoiding the high cost and limited generalization of backpropagation-based gradient computation. This approach introduces minimal additional computational overhead. Moreover, FOEM leverages precomputed Cholesky factors to efficiently recover the inverse of Hessian submatrices in real time. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 89.6%, and improves the 5-shot MMLU accuracy of Llama3-70B from 51.7% to 74.9%, approaching the full-precision performance of 78.6%. Furthermore, FOEM can be seamlessly integrated with advanced techniques such as GPTAQ and SpinQuant, yielding additional improvements under the challenging W4A4KV4 setting, and further narrowing the accuracy gap with full-precision baselines beyond what current state-of-the-art methods achieve. The code is available at https://github.com/Xingyu-Zheng/FOEM.
中文摘要:FOEM提出了一种新颖的训练后量化方法,通过引入一阶梯度项解决权重校准中的累积偏差问题,以极低计算开销显著提升模型性能。
English Summary: FOEM introduces a novel post-training quantization method that incorporates first-order gradient terms to address accumulated deviations in weight calibration, significantly improving model performance with minimal computational overhead.
Authors:Chongjie Si, Debing Zhang, Wei Shen
Abstract:
We propose AdaMuon, a novel optimizer that combines element-wise adaptivity with orthogonal updates for large-scale neural network training. AdaMuon incorporates two tightly coupled mechanisms: (1) an element-wise second momentum estimator applied to orthogonalized update directions, and (2) a sign-stabilized orthogonal update, where the momentum is first sign-transformed before orthogonalization. These two components jointly enable variance-adaptive scaling while maintaining stable update geometry. In addition, AdaMuon employs an RMS-aligned rescaling strategy to match the root-mean-square update magnitude to Adam, allowing direct reuse of existing learning rate schedules without extra tuning. Experiments demonstrate that AdaMuon not only maintains stability but can surpass Adam by more than 40% training efficiency in large-scale scenarios.
Chinese: AdaMuon是一种新型优化器,将逐元素自适应与正交更新相结合,在大规模神经网络训练中不仅保持稳定性,还能比Adam提高超过40%的训练效率。
English: AdaMuon is a novel optimizer that integrates element-wise adaptivity with orthogonal updates, enhancing training efficiency by over 40% compared to Adam in large-scale neural networks while maintaining stability.
Authors:Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon, Kunwoo Park
Abstract:
This paper presents HerO 2, Team HUMANE's system for the AVeriTeC shared task at the FEVER-25 workshop. HerO 2 is an enhanced version of HerO, the best-performing open-source model from the previous year's challenge. It improves evidence quality through document summarization and answer reformulation, optimizes veracity prediction via post-training quantization under computational constraints, and enhances overall system performance by integrating updated language model (LM) backbones. HerO 2 ranked second on the leaderboard while achieving the shortest runtime among the top three systems, demonstrating both high efficiency and strong potential for real-world fact verification. The code is available at https://github.com/ssu-humane/HerO2.
中文: HerO 2是一个增强型事实核查系统,通过证据总结、模型优化和更新语言模型提升了处理效率和预测性能,在竞赛中位列第二并实现前三名中最短运行时间。
English: HerO 2 is an enhanced fact verification system that improves evidence processing and prediction efficiency through summarization, model optimization, and updated language models, achieving second place with the fastest runtime among top competitors.
Authors:Yanbo Wang, Zipeng Fang, Lei Zhao, Weidong Chen
Abstract:
Service robots are increasingly deployed in diverse and dynamic environments, where both physical layouts and social contexts change over time and across locations. In these unstructured settings, conventional navigation systems that rely on fixed parameters often fail to generalize across scenarios, resulting in degraded performance and reduced social acceptance. Although recent approaches have leveraged reinforcement learning to enhance traditional planners, these methods often fail in real-world deployments due to poor generalization and limited simulation diversity, which hampers effective sim-to-real transfer. To tackle these issues, we present LE-Nav, an interpretable and scene-aware navigation framework that leverages multi-modal large language model reasoning and conditional variational autoencoders to adaptively tune planner hyperparameters. To achieve zero-shot scene understanding, we utilize one-shot exemplars and chain-of-thought prompting strategies. Additionally, a conditional variational autoencoder captures the mapping between natural language instructions and navigation hyperparameters, enabling expert-level tuning. Experiments show that LE-Nav can generate hyperparameters achieving human-level tuning across diverse planners and scenarios. Real-world navigation trials and a user study on a smart wheelchair platform demonstrate that it outperforms state-of-the-art methods on quantitative metrics such as success rate, efficiency, safety, and comfort, while receiving higher subjective scores for perceived safety and social acceptance. Code is available at https://github.com/Cavendish518/LE-Nav.
中文: LE-Nav是一种可解释的导航框架,通过多模态推理和条件变分自编码器动态调整规划器参数,在真实场景试验中实现了人类水平的性能并优于现有方法。
English: LE-Nav is an interpretable navigation framework that uses multi-modal reasoning and conditional variational autoencoders to dynamically tune planner parameters, achieving human-level performance and outperforming existing methods in real-world trials.
Authors:Quan Bi Pay, Vishnu Monn Baskaran, Junn Yong Loo, KokSheik Wong, Simon See
Abstract:
The resurgence of convolutional neural networks (CNNs) in visual recognition tasks, exemplified by ConvNeXt, has demonstrated their capability to rival transformer-based architectures through advanced training methodologies and ViT-inspired design principles. However, both CNNs and transformers exhibit a simplicity bias, favoring straightforward features over complex structural representations. Furthermore, modern CNNs often integrate MLP-like blocks akin to those in transformers, but these blocks suffer from significant information redundancies, necessitating high expansion ratios to sustain competitive performance. To address these limitations, we propose SpaRTAN, a lightweight architectural design that enhances spatial and channel-wise information processing. SpaRTAN employs kernels with varying receptive fields, controlled by kernel size and dilation factor, to capture discriminative multi-order spatial features effectively. A wave-based channel aggregation module further modulates and reinforces pixel interactions, mitigating channel-wise redundancies. Combining the two modules, the proposed network can efficiently gather and dynamically contextualize discriminative features. Experimental results in ImageNet and COCO demonstrate that SpaRTAN achieves remarkable parameter efficiency while maintaining competitive performance. In particular, on the ImageNet-1k benchmark, SpaRTAN achieves 77. 7% accuracy with only 3.8M parameters and approximately 1.0 GFLOPs, demonstrating its ability to deliver strong performance through an efficient design. On the COCO benchmark, it achieves 50.0% AP, surpassing the previous benchmark by 1.2% with only 21.5M parameters. The code is publicly available at [https://github.com/henry-pay/SpaRTAN].
中文: SpaRTAN是一种轻量级CNN架构,通过多尺度卷积核和基于波的通道聚合模块优化空间与通道信息处理,在ImageNet和COCO基准测试中以少量参数实现了卓越的性能表现。
English: SpaRTAN is a lightweight CNN architecture that enhances spatial and channel-wise feature processing through multi-scale kernels and wave-based aggregation, achieving competitive performance with high parameter efficiency on ImageNet and COCO benchmarks.
Authors:Zhipeng He, Alexander Stevens, Chun Ouyang, Johannes De Smedt, Alistair Barros, Catarina Moreira
Abstract:
Adversarial attacks on tabular data present fundamental challenges distinct from image or text domains due to the heterogeneous nature of mixed categorical and numerical features. Unlike images where pixel perturbations maintain visual similarity, tabular data lacks intuitive similarity metrics, making it difficult to define imperceptible modifications. Additionally, traditional gradient-based methods prioritise $\ell_p$-norm constraints, often producing adversarial examples that deviate from the original data distributions, making them detectable. We propose a latent space perturbation framework using a mixed-input Variational Autoencoder (VAE) to generate imperceptible adversarial examples. The proposed VAE integrates categorical embeddings and numerical features into a unified latent manifold, enabling perturbations that preserve statistical consistency. We specify In-Distribution Success Rate (IDSR) to measure the proportion of adversarial examples that remain statistically indistinguishable from the input distribution. Evaluation across six publicly available datasets and three model architectures demonstrates that our method achieves substantially lower outlier rates and more consistent performance compared to traditional input-space attacks and other VAE-based methods adapted from image domain approaches. Our comprehensive analysis includes hyperparameter sensitivity, sparsity control mechanisms, and generative architectural comparisons, revealing that VAE-based attacks depend critically on reconstruction quality but offer superior practical utility when sufficient training data is available. This work highlights the importance of on-manifold perturbations for realistic adversarial attacks on tabular data, offering a robust approach for practical deployment. The source code can be accessed through https://github.com/ZhipengHe/VAE-TabAttack.
中文: 本文提出了一种基于混合输入变分自编码器的潜在空间扰动框架,可为表格数据生成难以察觉的对抗样本,相比传统方法具有更优的统计一致性和更低的异常率。
English: This paper introduces a latent space perturbation framework using a mixed-input Variational Autoencoder to generate imperceptible adversarial examples for tabular data, achieving superior statistical consistency and lower outlier rates compared to traditional methods.
Authors:Rodney Lafuente-Mercado
Abstract:
Scaling reinforcement learning (RL) workloads often requires distributing environment simulation across compute clusters. Existing frameworks entangle simulation, learning logic, and orchestration into monolithic systems, limiting modularity and reusability. We present ClusterEnv, a lightweight, learner-agnostic interface for distributed environment execution that mirrors the Gymnasium API. ClusterEnv introduces the DETACH pattern, which decouples simulation from training by offloading reset() and step() operations to remote workers while keeping learning centralized. To address policy staleness in distributed execution, we propose Adaptive Actor Policy Synchronization (AAPS), a divergence-triggered update mechanism that reduces synchronization overhead without sacrificing performance. ClusterEnv integrates cleanly into existing RL pipelines, supports both on-policy and off-policy methods, and requires minimal code changes. Experiments on discrete control tasks demonstrate that AAPS achieves high sample efficiency with significantly fewer weight updates. Source code is available at https://github.com/rodlaf/ClusterEnv.
中文:ClusterEnv是一个轻量级、与学习器无关的分布式强化学习接口,采用DETACH模式将模拟与训练解耦,并通过自适应策略同步机制(AAPS)减少同步开销,从而提升效率。
English: ClusterEnv is a lightweight, learner-agnostic interface for distributed reinforcement learning that decouples simulation from training using the DETACH pattern and enhances efficiency with Adaptive Actor Policy Synchronization (AAPS) to minimize synchronization overhead.
Authors:Ayush Gupta, Siyuan Huang, Rama Chellappa
Abstract:
Gait is becoming popular as a method of person re-identification because of its ability to identify people at a distance. However, most current works in gait recognition do not address the practical problem of occlusions. Among those which do, some require paired tuples of occluded and holistic sequences, which are impractical to collect in the real world. Further, these approaches work on occlusions but fail to retain performance on holistic inputs. To address these challenges, we propose RG-Gait, a method for residual correction for occluded gait recognition with holistic retention. We model the problem as a residual learning task, conceptualizing the occluded gait signature as a residual deviation from the holistic gait representation. Our proposed network adaptively integrates the learned residual, significantly improving performance on occluded gait sequences without compromising the holistic recognition accuracy. We evaluate our approach on the challenging Gait3D, GREW and BRIAR datasets and show that learning the residual can be an effective technique to tackle occluded gait recognition with holistic retention. We release our code publicly at https://github.com/Ayush-00/rg-gait.
中文摘要:RG-Gait提出一种残差修正方法,将遮挡步态建模为完整步态的偏差表示,在提升遮挡识别能力的同时保持了对无遮挡步态的识别精度。
English Summary: RG-Gait introduces a residual correction method that models occluded gait as deviations from holistic representations, enhancing occlusion recognition without sacrificing performance on unoccluded inputs.
Authors:Quan Bi Pay, Vishnu Monn Baskaran, Junn Yong Loo, KokSheik Wong, Simon See
Abstract:
Human-object interaction (HOI) detection is essential for accurately localizing and characterizing interactions between humans and objects, providing a comprehensive understanding of complex visual scenes across various domains. However, existing HOI detectors often struggle to deliver reliable predictions efficiently, relying on resource-intensive training methods and inefficient architectures. To address these challenges, we conceptualize a wavelet attention-like backbone and a novel ray-based encoder architecture tailored for HOI detection. Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from diverse convolutional filters. Concurrently, the ray-based encoder facilitates multi-scale attention by optimizing the focus of the decoder on relevant regions of interest and mitigating computational overhead. As a result of harnessing the attenuated intensity of learnable ray origins, our decoder aligns query embeddings with emphasized regions of interest for accurate predictions. Experimental results on benchmark datasets, including ImageNet and HICO-DET, showcase the potential of our proposed architecture. The code is publicly available at [https://github.com/henry-pay/RayEncoder].
Chinese: 本研究提出了一种小波类注意力主干和基于射线的编码器,通过高效聚合多尺度特征并优化计算焦点,提升了人-物交互检测的性能,在基准数据集上取得了良好效果。
English: The study introduces a wavelet attention-like backbone and a ray-based encoder to enhance human-object interaction detection by efficiently aggregating multi-scale features and optimizing computational focus, achieving promising results on benchmark datasets.
Authors:Ashfaq Ali Shafin, Khandaker Mamun Ahmed
Abstract:
State-sponsored information operations (IOs) increasingly influence global discourse on social media platforms, yet their emotional and rhetorical strategies remain inadequately characterized in scientific literature. This study presents the first comprehensive analysis of toxic language deployment within such campaigns, examining 56 million posts from over 42 thousand accounts linked to 18 distinct geopolitical entities on X/Twitter. Using Google's Perspective API, we systematically detect and quantify six categories of toxic content and analyze their distribution across national origins, linguistic structures, and engagement metrics, providing essential information regarding the underlying patterns of such operations. Our findings reveal that while toxic content constitutes only 1.53% of all posts, they are associated with disproportionately high engagement and appear to be strategically deployed in specific geopolitical contexts. Notably, toxic content originating from Russian influence operations receives significantly higher user engagement compared to influence operations from any other country in our dataset. Our code is available at https://github.com/shafin191/Toxic_IO.
中文摘要:本研究分析了X/Twitter平台上5600万条国家支持的信息操作帖子,发现尽管有毒内容仅占帖子的1.53%,却能引发异常高的用户参与度,且在特定地缘政治背景下被策略性部署,其中俄罗斯信息操作获得的用户互动显著高于其他国家。
English Summary: This study analyzes 56 million posts from state-sponsored information operations on X/Twitter, revealing that although toxic content makes up only 1.53% of posts, it generates disproportionately high engagement and is strategically deployed in specific geopolitical contexts, with Russian operations receiving notably higher user interaction.
Authors:Shaowen Tong, Zimin Xia, Alexandre Alahi, Xuming He, Yujiao Shi
Abstract:
Cross-view localization, the task of estimating a camera's 3-degrees-of-freedom (3-DoF) pose by aligning ground-level images with satellite images, is crucial for large-scale outdoor applications like autonomous navigation and augmented reality. Existing methods often rely on fully supervised learning, which requires costly ground-truth pose annotations. In this work, we propose GeoDistill, a Geometry guided weakly supervised self distillation framework that uses teacher-student learning with Field-of-View (FoV)-based masking to enhance local feature learning for robust cross-view localization. In GeoDistill, the teacher model localizes a panoramic image, while the student model predicts locations from a limited FoV counterpart created by FoV-based masking. By aligning the student's predictions with those of the teacher, the student focuses on key features like lane lines and ignores textureless regions, such as roads. This results in more accurate predictions and reduced uncertainty, regardless of whether the query images are panoramas or limited FoV images. Our experiments show that GeoDistill significantly improves localization performance across different frameworks. Additionally, we introduce a novel orientation estimation network that predicts relative orientation without requiring precise planar position ground truth. GeoDistill provides a scalable and efficient solution for real-world cross-view localization challenges. Code and model can be found at https://github.com/tongshw/GeoDistill.
中文: GeoDistill提出了一种弱监督自蒸馏框架,通过结合教师-学生学习和基于视场角的掩码来增强跨视角定位,无需昂贵标注即可提升精度并降低不确定性。
English: GeoDistill introduces a weakly supervised self-distillation framework that enhances cross-view localization by aligning teacher-student predictions with FoV-based masking, improving accuracy and reducing uncertainty without costly annotations.
Authors:Roman Naeem, David Hagerman, Jennifer Alvén, Lennart Svensson, Fredrik Kahl
Abstract:
Tubular tree structures, such as blood vessels and airways, are essential in human anatomy and accurately tracking them while preserving their topology is crucial for various downstream tasks. Trexplorer is a recurrent model designed for centerline tracking in 3D medical images but it struggles with predicting duplicate branches and terminating tracking prematurely. To address these issues, we present Trexplorer Super, an enhanced version that notably improves performance through novel advancements. However, evaluating centerline tracking models is challenging due to the lack of public datasets. To enable thorough evaluation, we develop three centerline datasets, one synthetic and two real, each with increasing difficulty. Using these datasets, we conduct a comprehensive evaluation of existing state-of-the-art (SOTA) models and compare them with our approach. Trexplorer Super outperforms previous SOTA models on every dataset. Our results also highlight that strong performance on synthetic data does not necessarily translate to real datasets. The code and datasets are available at https://github.com/RomStriker/Trexplorer-Super.
Chinese: Trexplorer Super 是一种增强模型,在三维医学图像中对管状结构中心线追踪的表现优于现有最先进方法,并在新开发的合成和真实数据集上进行了全面评估。
English: Trexplorer Super is an enhanced model that outperforms existing state-of-the-art methods in centerline tracking for tubular structures in 3D medical images, with evaluations conducted on newly developed synthetic and real datasets.
Authors:Chetan Madan, Aarjav Satia, Soumen Basu, Pankaj Gupta, Usha Dutta, Chetan Arora
Abstract:
Masked Autoencoders (MAEs) have emerged as a dominant strategy for self-supervised representation learning in natural images, where models are pre-trained to reconstruct masked patches with a pixel-wise mean squared error (MSE) between original and reconstructed RGB values as the loss. We observe that MSE encourages blurred image re-construction, but still works for natural images as it preserves dominant edges. However, in medical imaging, when the texture cues are more important for classification of a visual abnormality, the strategy fails. Taking inspiration from Gray Level Co-occurrence Matrix (GLCM) feature in Radiomics studies, we propose a novel MAE based pre-training framework, GLCM-MAE, using reconstruction loss based on matching GLCM. GLCM captures intensity and spatial relationships in an image, hence proposed loss helps preserve morphological features. Further, we propose a novel formulation to convert matching GLCM matrices into a differentiable loss function. We demonstrate that unsupervised pre-training on medical images with the proposed GLCM loss improves representations for downstream tasks. GLCM-MAE outperforms the current state-of-the-art across four tasks - gallbladder cancer detection from ultrasound images by 2.1%, breast cancer detection from ultrasound by 3.1%, pneumonia detection from x-rays by 0.5%, and COVID detection from CT by 0.6%. Source code and pre-trained models are available at: https://github.com/ChetanMadan/GLCM-MAE.
中文: 掩码自编码器(MAE)通过引入基于灰度共生矩阵(GLCM)的重建损失,有效保留了医学图像中的关键纹理特征,从而在胆囊癌、乳腺癌、肺炎和COVID-19检测任务中显著提升了识别性能。
English: Masked Autoencoders (MAEs) are enhanced with a novel GLCM-based reconstruction loss that preserves critical texture features in medical images, leading to improved performance in detecting diseases like cancer and pneumonia across various imaging modalities.
Authors:Motoki Omura, Yusuke Mukuta, Kazuki Ota, Takayuki Osa, Tatsuya Harada
Abstract:
Offline reinforcement learning (RL) aims to learn an optimal policy from a static dataset, making it particularly valuable in scenarios where data collection is costly, such as robotics. A major challenge in offline RL is distributional shift, where the learned policy deviates from the dataset distribution, potentially leading to unreliable out-of-distribution actions. To mitigate this issue, regularization techniques have been employed. While many existing methods utilize density ratio-based measures, such as the $f$-divergence, for regularization, we propose an approach that utilizes the Wasserstein distance, which is robust to out-of-distribution data and captures the similarity between actions. Our method employs input-convex neural networks (ICNNs) to model optimal transport maps, enabling the computation of the Wasserstein distance in a discriminator-free manner, thereby avoiding adversarial training and ensuring stable learning. Our approach demonstrates comparable or superior performance to widely used existing methods on the D4RL benchmark dataset. The code is available at https://github.com/motokiomura/Q-DOT .
Chinese: 离线强化学习通过引入瓦瑟斯坦距离进行正则化,利用输入凸神经网络无需对抗训练即可计算该距离,在D4RL基准测试中表现优异。
English: Offline reinforcement learning tackles distributional shift by using the Wasserstein distance for regularization, employing input-convex neural networks to compute it without adversarial training, achieving strong results on the D4RL benchmark.
Authors:Ali Hojjat, Janek Haberer, Soren Pirk, Olaf Landsiedel
Abstract:
Vision Transformers deliver state-of-the-art performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent nested Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT initiates inference by activating a small subset of the most important attention heads and terminates early if predictions reach sufficient certainty. Otherwise, it activates additional attention heads and re-evaluates the input. At the core of ThinkingViT is our Token Recycling mechanism, which conditions each subsequent inference stage on the embeddings from the previous stage, enabling progressive improvement. Due to its backbone-preserving design, ThinkingViT also serves as a plugin upgrade for vanilla ViT. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K. The source code is available at https://github.com/ds-kiel/ThinkingViT.
中文: ThinkingViT提出了一种嵌套视觉变换器架构,通过渐进式思考阶段和令牌回收机制,根据输入复杂度动态调整计算资源,实现了更高的准确性和效率。
English: ThinkingViT introduces a nested Vision Transformer architecture that dynamically adjusts computational resources based on input complexity, achieving superior accuracy and efficiency through progressive thinking stages and token recycling.
Authors:Yuchen Wang, Hongjue Zhao, Haohong Lin, Enze Xu, Lifang He, Huajie Shao
Abstract:
This work aims to address the problem of long-term dynamic forecasting in complex environments where data are noisy and irregularly sampled. While recent studies have introduced some methods to improve prediction performance, these approaches still face a significant challenge in handling long-term extrapolation tasks under such complex scenarios. To overcome this challenge, we propose Phy-SSM, a generalizable method that integrates partial physics knowledge into state space models (SSMs) for long-term dynamics forecasting in complex environments. Our motivation is that SSMs can effectively capture long-range dependencies in sequential data and model continuous dynamical systems, while the incorporation of physics knowledge improves generalization ability. The key challenge lies in how to seamlessly incorporate partially known physics into SSMs. To achieve this, we decompose partially known system dynamics into known and unknown state matrices, which are integrated into a Phy-SSM unit. To further enhance long-term prediction performance, we introduce a physics state regularization term to make the estimated latent states align with system dynamics. Besides, we theoretically analyze the uniqueness of the solutions for our method. Extensive experiments on three real-world applications, including vehicle motion prediction, drone state prediction, and COVID-19 epidemiology forecasting, demonstrate the superior performance of Phy-SSM over the baselines in both long-term interpolation and extrapolation tasks. The code is available at https://github.com/511205787/Phy_SSM-ICML2025.
Chinese: 本研究提出Phy-SSM方法,将部分物理知识融入状态空间模型,以提升复杂噪声环境下的长期动态预测能力,在车辆运动、无人机状态和疫情预测等实际应用中表现出卓越性能。
English: This study introduces Phy-SSM, a method that integrates partial physics knowledge into state space models to enhance long-term dynamic forecasting in complex, noisy environments, demonstrating superior performance in real-world applications like vehicle motion and epidemiology prediction.
Authors:Kristóf Müller, Janka Hatvani, Márton Ãron Goda, Miklós Koller
Abstract:
Motivation. Phonocardiography can give access to the fetal heart rate as well as direct heart sound data, and is entirely passive, using no radiation of any kind. Approach. We discuss the currently available methods for fetal heart sound detection and heart rate estimation and compare them using a common benchmarking platform and a pre-selected testing dataset. Compared to previous reviews, we evaluated the discussed methods in a standardized manner for a fair comparison. Our tests included tolerance-based detection accuracy, error rates for label insertions, deletions, and substitutions, and statistical measures for heart rate mean square error. Results. Based on our results, there is no definite best method that can achieve the highest scores in all of the tests, and simpler methods could perform comparably to more complex ones. The best model for first heart sound detection achieved 97.6% F1-score, 97.4% positive predictive value, and 12.2+-8.0 ms mean absolute error. In terms of second heart sound detection the best model had 91.4% F1-score, 91.3% positive predictive value, and 17.3+-12.2 ms mean absolute error. For fetal heart rate a 0.644 mean square error was achieved by the best method. Significance. Our main conclusion is that further standardization is required in fetal heart rate and heart sound detection method evaluation. The tests and algorithm implementations are openly available at: https://github.com/mulkr/standard-fpcg-evaluation.
中文摘要:本研究通过标准化基准测试评估多种胎儿心音检测和心率估算方法,发现没有单一最佳方法且简单方法可与复杂方法相媲美,最终得出结论认为评估过程需要进一步标准化。
English Summary: This study evaluates various fetal heart sound detection and heart rate estimation methods through standardized benchmarking, finding no single best approach while demonstrating that simpler methods can match complex ones, and concluding that further standardization in evaluation is needed.
Authors:Hsiang-Wei Huang, Jen-Hao Cheng, Kuang-Ming Chen, Cheng-Yen Yang, Bahaa Alattar, Yi-Ru Lin, Pyongkun Kim, Sangwon Kim, Kwangju Kim, Chung-I Huang, Jenq-Neng Hwang
Abstract:
Spatial understanding has been a challenging task for existing Multi-modal Large Language Models~(MLLMs). Previous methods leverage large-scale MLLM finetuning to enhance MLLM's spatial understanding ability. In this paper, we present a data-efficient approach. We propose a LLM agent system with strong and advanced spatial reasoning ability, which can be used to solve the challenging spatial question answering task in complex indoor warehouse scenarios. Our system integrates multiple tools that allow the LLM agent to conduct spatial reasoning and API tools interaction to answer the given complicated spatial question. Extensive evaluations on the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse dataset demonstrate that our system achieves high accuracy and efficiency in tasks such as object retrieval, counting, and distance estimation. The code is available at: https://github.com/hsiangwei0903/SpatialAgent
Chinese: 本文提出了一种数据高效的LLM智能体系统,具备先进的空间推理能力,通过整合多种工具在复杂仓库场景中精确解决空间问答任务,并在AI City Challenge数据集上展现出优异性能。
English: This paper introduces a data-efficient LLM agent system with advanced spatial reasoning capabilities, utilizing multiple tools to accurately solve complex spatial tasks in warehouse environments, as demonstrated by high performance on the AI City Challenge dataset.
Authors:Jeffrey Joan Sam, Janhavi Sathe, Nikhil Chigali, Naman Gupta, Radhey Ruparel, Yicheng Jiang, Janmajay Singh, James W. Berck, Arko Barman
Abstract:
Spacecraft deployed in outer space are routinely subjected to various forms of damage due to exposure to hazardous environments. In addition, there are significant risks to the subsequent process of in-space repairs through human extravehicular activity or robotic manipulation, incurring substantial operational costs. Recent developments in image segmentation could enable the development of reliable and cost-effective autonomous inspection systems. While these models often require large amounts of training data to achieve satisfactory results, publicly available annotated spacecraft segmentation data are very scarce. Here, we present a new dataset of nearly 64k annotated spacecraft images that was created using real spacecraft models, superimposed on a mixture of real and synthetic backgrounds generated using NASA's TTALOS pipeline. To mimic camera distortions and noise in real-world image acquisition, we also added different types of noise and distortion to the images. Finally, we finetuned YOLOv8 and YOLOv11 segmentation models to generate performance benchmarks for the dataset under well-defined hardware and inference time constraints to mimic real-world image segmentation challenges for real-time onboard applications in space on NASA's inspector spacecraft. The resulting models, when tested under these constraints, achieved a Dice score of 0.92, Hausdorff distance of 0.69, and an inference time of about 0.5 second. The dataset and models for performance benchmark are available at https://github.com/RiceD2KLab/SWiM.
Chinese: 为解决航天器自主检测系统训练数据匮乏的问题,开发了一个包含近6.4万张标注图像的新数据集,经过优化的YOLO模型在模拟太空环境下实现了高精度和实时性能。
English: A new dataset of nearly 64,000 annotated spacecraft images has been developed to address the scarcity of training data for autonomous inspection systems, with fine-tuned YOLO models achieving high accuracy and real-time performance under simulated space conditions.
Authors:Ryan Zarick, Isaac Zhang, Daniel Wong, Thomas Kim, Bryan Pellegrino, Mignon Li, Kelvin Wong
Abstract:
Current blockchain execution throughput is limited by data contention, reducing execution layer parallelism. Fast Ahead-of-Formation Optimization (FAFO) is the first blockchain transaction scheduler to address this problem by reordering transactions before block formation for maximum concurrency. FAFO uses CPU-optimized cache-friendly Bloom filters to efficiently detect conflicts and schedule parallel transaction execution at high throughput and low overhead.
We integrate the Rust EVM client (REVM) into FAFO and achieve over 1.1 million native ETH transfers per second and over half a million ERC20 transfers per second on a single node (Table 1), with 91% lower cost compared to state-of-the-art sharded execution. Unlike many other existing high throughput blockchain execution clients, FAFO uses QMDB to Merkleize world state after every block, enabling light clients and stateless validation for ZK-based vApps. FAFO scales with minimal synchronization overhead, scaling linearly with additional CPU resources until it fully exploits the maximum parallelism of the underlying transaction flow. FAFO proves that the high throughput necessary to support future decentralized applications can be achieved with a streamlined execution layer and innovations in blockchain transaction scheduler design. FAFO is open-sourced at https://github.com/LayerZero-Labs/fafo.
中文: FAFO是一种创新的区块链交易调度器,通过在区块形成前重新排序交易来提高执行吞吐量,实现了每秒超过110万次ETH转账,且开销极低。
English: FAFO is a groundbreaking blockchain transaction scheduler that enhances execution throughput by reordering transactions before block formation, achieving over 1.1 million ETH transfers per second with minimal overhead.
Authors:Bright Kwaku Manu, Trevor Reckell, Beckett Sterner, Petar Jevtic
Abstract:
Stochastic Petri Nets (SPNs) are an increasingly popular tool of choice for modeling discrete-event dynamics in areas such as epidemiology and systems biology, yet their parameter estimation remains challenging in general and in particular when transition rates depend on external covariates and explicit likelihoods are unavailable. We introduce a neural-surrogate (neural-network--based approximation of the posterior distribution) framework that predicts the coefficients of known covariate-dependent rate functions directly from noisy, partially observed token trajectories. Our model employs a lightweight 1D Convolutional Residual Network trained end-to-end on Gillespie-simulated SPN realizations, learning to invert system dynamics under realistic conditions of event dropout. During inference, Monte Carlo dropout provides calibrated uncertainty bounds together with point estimates. On synthetic SPNs with 20% missing events, our surrogate recovers rate-function coefficients with an RMSE = 0.108 and substantially runs faster than traditional Bayesian approaches. These results demonstrate that data-driven, likelihood-free surrogates can enable accurate, robust, and real-time parameter recovery in complex, partially observed discrete-event systems.
中文: 本研究提出了一种基于一维卷积残差网络的神经代理框架,能够高效准确地估计存在数据缺失的随机Petri网参数,在速度和精度上均优于传统贝叶斯方法。
English: The study introduces a neural-surrogate framework using a 1D Convolutional Residual Network to accurately and efficiently estimate parameters in Stochastic Petri Nets with missing data, outperforming traditional Bayesian methods in speed and precision.
Authors:Tongshun Zhang, Pingping Liu, Yubing Lu, Mengen Cai, Zijian Zhang, Zhe Zhang, Qiuzhan Zhou
Abstract:
Traditional Low-Light Image Enhancement (LLIE) methods primarily focus on uniform brightness adjustment, often neglecting instance-level semantic information and the inherent characteristics of different features. To address these limitations, we propose CWNet (Causal Wavelet Network), a novel architecture that leverages wavelet transforms for causal reasoning. Specifically, our approach comprises two key components: 1) Inspired by the concept of intervention in causality, we adopt a causal reasoning perspective to reveal the underlying causal relationships in low-light enhancement. From a global perspective, we employ a metric learning strategy to ensure causal embeddings adhere to causal principles, separating them from non-causal confounding factors while focusing on the invariance of causal factors. At the local level, we introduce an instance-level CLIP semantic loss to precisely maintain causal factor consistency. 2) Based on our causal analysis, we present a wavelet transform-based backbone network that effectively optimizes the recovery of frequency information, ensuring precise enhancement tailored to the specific attributes of wavelet transforms. Extensive experiments demonstrate that CWNet significantly outperforms current state-of-the-art methods across multiple datasets, showcasing its robust performance across diverse scenes. Code is available at https://github.com/bywlzts/CWNet-Causal-Wavelet-Network.
中文: CWNet提出了一种基于因果推理和小波变换的新架构,通过保持语义一致性和优化频率恢复来解决传统低光图像增强的局限性,在多个数据集上展现出卓越性能。
English: CWNet introduces a causal reasoning approach and wavelet transform-based architecture to address the limitations of traditional low-light image enhancement by preserving semantic consistency and optimizing frequency recovery, achieving superior performance across diverse datasets.
Authors:Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, Dandan Tu
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs), particularly in the domain of complex reasoning tasks. However, prevailing on-policy RL methods often contend with significant training instability and inefficiency. This is primarily due to a capacity-difficulty mismatch, where the complexity of training data frequently outpaces the model's current capabilities, leading to critically sparse reward signals and stalled learning progress. This challenge is particularly acute for smaller, more resource-efficient LLMs. To overcome this, we introduce the Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework. GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance. This unique approach adaptively balances direct imitation learning for problems currently beyond the model's reach with exploration-based reinforcement learning for more manageable tasks, effectively creating a smooth and optimized learning curriculum. Extensive experiments demonstrate that GHPO achieves an average performance gain of approximately 5% across six challenging mathematics benchmarks, consistently outperforming strong on-policy reinforcement learning and curriculum learning baselines. Further analysis confirms that our framework significantly enhances both training stability and final reasoning performance, thus offering a scalable and efficient solution for developing powerful and robust reasoning models.
中文摘要:引导式混合策略优化(GHPO)框架通过自适应提示调整动态匹配任务难度,有效解决了语言模型强化学习中的训练不稳定问题,在数学推理基准测试中实现了约5%的性能提升。
English Summary: The Guided Hybrid Policy Optimization (GHPO) framework addresses training instability in reinforcement learning for language models by dynamically adjusting task difficulty through adaptive prompt refinement, achieving significant performance gains across mathematical reasoning benchmarks.
Authors:Ruixi Zheng, Wei Zhang, Yijie Li, Xi Zhu, Zhou Lan, Jarrett Rushmore, Yogesh Rathi, Nikos Makris, Lauren J. O'Donnell, Fan Zhang
Abstract:
Diffusion MRI (dMRI) tractography is currently the only method for in vivo mapping of the brain's white matter (WM) connections. Tractometry is an advanced tractography analysis technique for along-tract profiling to investigate the morphology and microstructural properties along the fiber tracts. Tractometry has become an essential tool for studying local along-tract differences between different populations (e.g., health vs disease). In this study, we propose a novel atlas-guided fine-scale tractometry method, namely AGFS-Tractometry, that leverages tract spatial information and permutation testing to enhance the along-tract statistical analysis between populations. There are two major contributions in AGFS-Tractometry. First, we create a novel atlas-guided tract profiling template that enables consistent, fine-scale, along-tract parcellation of subject-specific fiber tracts. Second, we propose a novel nonparametric permutation testing group comparison method to enable simultaneous analysis across all along-tract parcels while correcting for multiple comparisons. We perform experimental evaluations on synthetic datasets with known group differences and in vivo real data. We compare AGFS-Tractometry with two state-of-the-art tractometry methods, including Automated Fiber-tract Quantification (AFQ) and BUndle ANalytics (BUAN). Our results show that the proposed AGFS-Tractometry obtains enhanced sensitivity and specificity in detecting local WM differences. In the real data analysis experiments, AGFS-Tractometry can identify more regions with significant differences, which are anatomically consistent with the existing literature. Overall, these demonstrate the ability of AGFS-Tractometry to detect subtle or spatially localized WM group-level differences. The created tract profiling template and related code are available at: https://github.com/ZhengRuixi/AGFS-Tractometry.git.
Chinese: 本研究提出了一种新型图谱引导的精细纤维束测量方法AGFS-Tractometry,通过精细分段和先进统计检验增强了对白质局部差异的检测能力,实验证明其相较于现有方法具有更高的敏感性和特异性。
English: This study introduces AGFS-Tractometry, a novel atlas-guided method that enhances the detection of local white matter differences in diffusion MRI tractography through fine-scale parcellation and advanced statistical testing, demonstrating superior sensitivity and specificity compared to existing techniques.
Authors:Peng Ding
Abstract:
Large Language Model (LLM) applications are increasingly relying on external tools to extend their capabilities beyond text generation. However, current tool integration approaches suffer from fragmentation, protocol limitations, and implementation complexity, leading to substantial development overhead. This paper presents Toolregistry, a protocol-agnostic tool management library that simplifies tool registration, representation, execution, and lifecycle management via a unified interface. Our evaluation demonstrates that \toolregistry achieves 60-80% reduction in tool integration code, up to 3.1x performance improvements through concurrent execution, and 100% compatibility with OpenAI function calling standards. Real-world case studies show significant improvements in development efficiency and code maintainability across diverse integration scenarios. \toolregistry is open-source and available at https://github.com/Oaklight/ToolRegistry, with comprehensive documentation at https://toolregistry.readthedocs.io/.
中文: Toolregistry作为协议无关的工具管理库,通过统一接口简化了LLM工具集成,减少60-80%代码量并提升性能,同时完全兼容OpenAI标准。
English: Toolregistry is a protocol-agnostic library that simplifies tool integration for LLMs, reducing code by 60-80% while improving performance and maintaining full OpenAI compatibility.
Authors:Kexin Gu Baugh, Vincent Perreault, Matthew Baugh, Luke Dickens, Katsumi Inoue, Alessandra Russo
Abstract:
Neural Disjunctive Normal Form (DNF) based models are powerful and interpretable approaches to neuro-symbolic learning and have shown promising results in classification and reinforcement learning settings without prior knowledge of the tasks. However, their performance is degraded by the thresholding of the post-training symbolic translation process. We show here that part of the performance degradation during translation is due to its failure to disentangle the learned knowledge represented in the form of the networks' weights. We address this issue by proposing a new disentanglement method; by splitting nodes that encode nested rules into smaller independent nodes, we are able to better preserve the models' performance. Through experiments on binary, multiclass, and multilabel classification tasks (including those requiring predicate invention), we demonstrate that our disentanglement method provides compact and interpretable logical representations for the neural DNF-based models, with performance closer to that of their pre-translation counterparts. Our code is available at https://github.com/kittykg/disentangling-ndnf-classification.
Chinese: 本研究提出了一种解缠方法,通过拆分编码嵌套规则的节点来减轻神经DNF模型在符号翻译过程中的性能损失,从而在分类任务中获得更紧凑、可解释的逻辑表示和更高的准确性。
English: The study introduces a disentanglement method that splits nodes encoding nested rules in neural DNF models to mitigate performance loss during symbolic translation, resulting in more compact and interpretable logical representations with improved accuracy across classification tasks.
Authors:Juyi Sheng, Ziyi Wang, Peiming Li, Mengyuan Liu
Abstract:
In robot manipulation, robot learning has become a prevailing approach. However, generative models within this field face a fundamental trade-off between the slow, iterative sampling of diffusion models and the architectural constraints of faster Flow-based methods, which often rely on explicit consistency losses. To address these limitations, we introduce MP1, which pairs 3D point-cloud inputs with the MeanFlow paradigm to generate action trajectories in one network function evaluation (1-NFE). By directly learning the interval-averaged velocity via the "MeanFlow Identity", our policy avoids any additional consistency constraints. This formulation eliminates numerical ODE-solver errors during inference, yielding more precise trajectories. MP1 further incorporates CFG for improved trajectory controllability while retaining 1-NFE inference without reintroducing structural constraints. Because subtle scene-context variations are critical for robot learning, especially in few-shot learning, we introduce a lightweight Dispersive Loss that repels state embeddings during training, boosting generalization without slowing inference. We validate our method on the Adroit and Meta-World benchmarks, as well as in real-world scenarios. Experimental results show MP1 achieves superior average task success rates, outperforming DP3 by 10.2% and FlowPolicy by 7.3%. Its average inference time is only 6.8 ms-19x faster than DP3 and nearly 2x faster than FlowPolicy. Our code is available at https://github.com/LogSSim/MP1.git.
Chinese: MP1提出创新的MeanFlow范式,通过直接学习区间平均速度,在单次网络评估中生成精确的机器人动作轨迹,在基准测试中实现了更高的任务成功率,推理速度比现有方法快19倍。
English: MP1 introduces a novel MeanFlow paradigm that generates precise robot action trajectories in a single network evaluation by learning interval-averaged velocity, achieving superior success rates and 19x faster inference than prior methods.
Authors:Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You
Abstract:
World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks. Existing WMs primarily focus on unstructured data and cannot leverage the ubiquitous structured data, often represented as graphs, in the digital world. While multiple graph foundation models have been proposed, they focus on graph learning tasks and cannot extend to diverse multi-modal data and interdisciplinary tasks. To address these challenges, we propose the Graph World Model (GWM), a world model that supports both unstructured and graph-structured states with multi-modal information and represents diverse tasks as actions. The core of a GWM is a generic message-passing algorithm to aggregate structured information, either over a unified multi-modal token space by converting multi-modal data into text (GWM-T) or a unified multi-modal embedding space by modality-specific encoders (GWM-E). Notably, GWM introduces action nodes to support diverse tasks, where action nodes are linked to other nodes via direct reference or similarity computation. Extensive experiments on six tasks from diverse domains, including multi-modal generation and matching, recommendation, graph prediction, multi-agent, retrieval-augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain-specific baselines' performance, benefits from multi-hop structures, and demonstrates strong zero-shot/few-shot capabilities on unseen new tasks. Our code for GWM is released at https://github.com/ulab-uiuc/GWM.
中文: 提出的图世界模型(GWM)通过统一的消息传递框架整合非结构化和图结构的多模态数据,在六项跨领域任务中不仅超越专业基线模型,还展现出强大的零样本/小样本泛化能力。
English: The proposed Graph World Model (GWM) integrates both unstructured and graph-structured data with multi-modal information through a unified message-passing framework, demonstrating superior performance across six diverse tasks compared to specialized baselines while exhibiting strong generalization capabilities.
Authors:Qihui Yang, Taylor Berg-Kirkpatrick, Julian McAuley, Zachary Novack
Abstract:
Despite rapid progress in end-to-end AI music generation, AI-driven modeling of professional Digital Signal Processing (DSP) workflows remains challenging. In particular, while there is growing interest in neural black-box modeling of audio effect graphs (e.g. reverb, compression, equalization), AI-based approaches struggle to replicate the nuanced signal flow and parameter interactions used in professional workflows. Existing differentiable plugin approaches often diverge from real-world tools, exhibiting inferior performance relative to simplified neural controllers under equivalent computational constraints. We introduce WildFX, a pipeline containerized with Docker for generating multi-track audio mixing datasets with rich effect graphs, powered by a professional Digital Audio Workstation (DAW) backend. WildFX supports seamless integration of cross-platform commercial plugins or any plugins in the wild, in VST/VST3/LV2/CLAP formats, enabling structural complexity (e.g., sidechains, crossovers) and achieving efficient parallelized processing. A minimalist metadata interface simplifies project/plugin configuration. Experiments demonstrate the pipeline's validity through blind estimation of mixing graphs, plugin/gain parameters, and its ability to bridge AI research with practical DSP demands. The code is available on: https://github.com/IsaacYQH/WildFX.
Chinese Summary: WildFX推出基于Docker容器化的多轨音频混音数据集生成管道,通过无缝集成商业插件和高效并行处理,使AI研究能更好地模拟专业数字信号处理工作流程。
English Summary: WildFX introduces a Docker-containerized pipeline for generating multi-track audio mixing datasets with complex effect graphs, enabling AI research to better replicate professional DSP workflows through seamless plugin integration and efficient processing.
Authors:Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun
Abstract:
Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.
中文:Mixture-of-Recursions (MoR) 框架在递归Transformer中巧妙融合了参数共享与自适应计算,通过在不同规模模型上实现更优性能,显著降低了计算和内存开销。
English: The Mixture-of-Recursions (MoR) framework efficiently combines parameter sharing and adaptive computation within a Recursive Transformer, enabling large-model performance with reduced computational and memory costs across various model scales.
Authors:Jennifer D'Souza, Endres Keno Sander, Andrei Aioanei
Abstract:
We introduce DeepResearch$^{\text{Eco}}$, a novel agentic LLM-based system for automated scientific synthesis that supports recursive, depth- and breadth-controlled exploration of original research questions -- enhancing search diversity and nuance in the retrieval of relevant scientific literature. Unlike conventional retrieval-augmented generation pipelines, DeepResearch enables user-controllable synthesis with transparent reasoning and parameter-driven configurability, facilitating high-throughput integration of domain-specific evidence while maintaining analytical rigor. Applied to 49 ecological research questions, DeepResearch achieves up to a 21-fold increase in source integration and a 14.9-fold rise in sources integrated per 1,000 words. High-parameter settings yield expert-level analytical depth and contextual diversity.
Source code available at: https://github.com/sciknoworg/deep-research.
中文: DeepResearch$^{\text{Eco}}$ 是一种创新的基于大语言模型的系统,通过用户可控的科学合成实现了增强的搜索多样性和分析严谨性,在生态研究应用中显著提升了文献整合效率并达到专家级分析深度。
English: DeepResearch$^{\text{Eco}}$ is an innovative LLM-based system that enables automated, user-controllable scientific synthesis with enhanced search diversity and analytical rigor, achieving significant improvements in source integration and expert-level depth in ecological research applications.
Authors:Chenyu Lian, Hong-Yu Zhou, Zhanli Hu, Jing Qin
Abstract:
Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, establishing a new SOTA. The benchmark is publicly available at https://github.com/DopamineLcy/BenchReAD.
Chinese: 该摘要针对视网膜异常检测领域缺乏全面基准的问题,提出了一个系统性基准,并开发了NFM-DRA新方法,通过结合异常解耦表示与正常特征记忆机制,有效提升了检测性能并建立了新的技术标杆。
English: This abstract introduces a comprehensive benchmark for retinal anomaly detection to address the limitations of previous methods, proposing a novel approach called NFM-DRA that integrates disentangled representations of abnormalities with a normal feature memory to achieve state-of-the-art performance.
Authors:İsmail Tarım, AytuÄ Onan
Abstract:
The rapid advancement of large language models (LLMs) has raised concerns about reliably detecting AI-generated text. Stylometric metrics work well on autoregressive (AR) outputs, but their effectiveness on diffusion-based models is unknown. We present the first systematic comparison of diffusion-generated text (LLaDA) and AR-generated text (LLaMA) using 2 000 samples. Perplexity, burstiness, lexical diversity, readability, and BLEU/ROUGE scores show that LLaDA closely mimics human text in perplexity and burstiness, yielding high false-negative rates for AR-oriented detectors. LLaMA shows much lower perplexity but reduced lexical fidelity. Relying on any single metric fails to separate diffusion outputs from human writing. We highlight the need for diffusion-aware detectors and outline directions such as hybrid models, diffusion-specific stylometric signatures, and robust watermarking.
中文: 大语言模型的扩散生成文本(如LLaDA)在困惑度和突发性上高度模仿人类写作,能规避基于自回归的检测方法,亟需开发扩散感知检测器和混合策略。
English: Large language models' diffusion-generated text like LLaDA closely mimics human writing in perplexity and burstiness, evading detection by autoregressive-focused methods, necessitating the development of diffusion-aware detectors and hybrid approaches.
Authors:Zhicun Yin, Junjie Chen, Ming Liu, Zhixin Wang, Fan Li, Renjing Pei, Xiaoming Li, Rynson W. H. Lau, Wangmeng Zuo
Abstract:
Blind facial image restoration is highly challenging due to unknown complex degradations and the sensitivity of humans to faces. Although existing methods introduce auxiliary information from generative priors or high-quality reference images, they still struggle with identity preservation problems, mainly due to improper feature introduction on detailed textures. In this paper, we focus on effectively incorporating appropriate features from high-quality reference images, presenting a novel blind facial image restoration method that considers reference selection, transfer, and reconstruction (RefSTAR). In terms of selection, we construct a reference selection (RefSel) module. For training the RefSel module, we construct a RefSel-HQ dataset through a mask generation pipeline, which contains annotating masks for 10,000 ground truth-reference pairs. As for the transfer, due to the trivial solution in vanilla cross-attention operations, a feature fusion paradigm is designed to force the features from the reference to be integrated. Finally, we propose a reference image reconstruction mechanism that further ensures the presence of reference image features in the output image. The cycle consistency loss is also redesigned in conjunction with the mask. Extensive experiments on various backbone models demonstrate superior performance, showing better identity preservation ability and reference feature transfer quality. Source code, dataset, and pre-trained models are available at https://github.com/yinzhicun/RefSTAR.
中文摘要:本文提出RefSTAR方法,通过构建参考图像选择模块、特征融合范式和重建机制,有效利用高质量参考图像特征进行盲人脸图像恢复,显著提升了身份保持能力和特征迁移质量。
English Summary: This paper introduces RefSTAR, a novel blind facial image restoration method that effectively incorporates features from high-quality reference images through specialized selection, transfer, and reconstruction mechanisms to better preserve identity and transfer quality.
Authors:Yingqian Wu, Qiushi Wang, Zefei Long, Rong Ye, Zhongtian Lu, Xianyin Zhang, Bingxuan Li, Wei Chen, Liwen Zhang, Zhongyu Wei
Abstract:
Financial report generation tasks range from macro- to micro-economics analysis, also requiring extensive data analysis. Existing LLM models are usually fine-tuned on simple QA tasks and cannot comprehensively analyze real financial scenarios. Given the complexity, financial companies often distribute tasks among departments. Inspired by this, we propose FinTeam, a financial multi-agent collaborative system, with a workflow with four LLM agents: document analyzer, analyst, accountant, and consultant. We train these agents with specific financial expertise using constructed datasets. We evaluate FinTeam on comprehensive financial tasks constructed from real online investment forums, including macroeconomic, industry, and company analysis. The human evaluation shows that by combining agents, the financial reports generate from FinTeam achieved a 62.00% acceptance rate, outperforming baseline models like GPT-4o and Xuanyuan. Additionally, FinTeam's agents demonstrate a 7.43% average improvement on FinCUGE and a 2.06% accuracy boost on FinEval. Project is available at https://github.com/FudanDISC/DISC-FinLLM/.
中文摘要:FinTeam是一个多智能体协作系统,通过四个专业LLM代理的协同工作,在金融报告生成任务中表现优于现有模型,人工评估接受率达62%,并在金融基准测试中取得显著精度提升。
English Summary: FinTeam is a multi-agent system using four specialized LLM agents that outperforms existing models in financial report generation, achieving a 62% human acceptance rate and improved accuracy on financial benchmarks.
Authors:Shanshan Zhong, Jiawei Peng, Zehan Zheng, Zhongzhan Huang, Wufei Ma, Guofeng Zhang, Qihao Liu, Alan Yuille, Jieneng Chen
Abstract:
Existing methods for reconstructing animatable 3D animals from videos typically rely on sparse semantic keypoints to fit parametric models. However, obtaining such keypoints is labor-intensive, and keypoint detectors trained on limited animal data are often unreliable. To address this, we propose 4D-Animal, a novel framework that reconstructs animatable 3D animals from videos without requiring sparse keypoint annotations. Our approach introduces a dense feature network that maps 2D representations to SMAL parameters, enhancing both the efficiency and stability of the fitting process. Furthermore, we develop a hierarchical alignment strategy that integrates silhouette, part-level, pixel-level, and temporal cues from pre-trained 2D visual models to produce accurate and temporally coherent reconstructions across frames. Extensive experiments demonstrate that 4D-Animal outperforms both model-based and model-free baselines. Moreover, the high-quality 3D assets generated by our method can benefit other 3D tasks, underscoring its potential for large-scale applications. The code is released at https://github.com/zhongshsh/4D-Animal.
Chinese: 4D-Animal框架通过密集特征网络和分层对齐策略,无需稀疏关键点标注即可从视频中重建可动画的3D动物模型,其性能优于现有方法,生成的高质量资源可应用于其他3D任务。
English: The 4D-Animal framework reconstructs animatable 3D animals from videos without sparse keypoint annotations by using a dense feature network and hierarchical alignment strategy, outperforming existing methods and generating high-quality assets for other 3D tasks.
Authors:Qiang Li, Qingsen Yan, Haojian Huang, Peng Wu, Haokui Zhang, Yanning Zhang
Abstract:
With the rapid advancements in Artificial Intelligence Generated Image (AGI) technology, the accurate assessment of their quality has become an increasingly vital requirement. Prevailing methods typically rely on cross-modal models like CLIP or BLIP to evaluate text-image alignment and visual quality. However, when applied to AGIs, these methods encounter two primary challenges: semantic misalignment and details perception missing. To address these limitations, we propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment (SC-AGIQA), a unified framework that leverages text-visual semantic constraints to significantly enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images. Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules: the Text-assisted Semantic Alignment Module (TSAM), which leverages Multimodal Large Language Models (MLLMs) to bridge the semantic gap by generating an image description and comparing it against the original prompt for a refined consistency check, and the Frequency-domain Fine-Grained Degradation Perception Module (FFDPM), which draws inspiration from Human Visual System (HVS) properties by employing frequency domain analysis combined with perceptual sensitivity weighting to better quantify subtle visual distortions and enhance the capture of fine-grained visual quality details in images. Extensive experiments conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods. The code is publicly available at https://github.com/mozhu1/SC-AGIQA.
中文: 提出的SC-AGIQA框架通过文本辅助语义对齐模块提升图文一致性,结合频域细粒度退化感知模块增强视觉失真量化,有效解决了AI生成图像评估中的语义错位和细节感知缺失问题,在多个基准数据集上表现出超越现有方法的优越性能。
English: The proposed SC-AGIQA framework addresses semantic misalignment and detail perception issues in AI-generated image assessment by integrating a Text-assisted Semantic Alignment Module for improved text-image consistency and a Frequency-domain Fine-Grained Degradation Perception Module for enhanced visual distortion quantification, demonstrating superior performance over existing methods.
Authors:Utkarsh Singhal, Ryan Feng, Stella X. Yu, Atul Prakash
Abstract:
Perception in the real world requires robustness to diverse viewing conditions. Existing approaches often rely on specialized architectures or training with predefined data augmentations, limiting adaptability. Taking inspiration from mental rotation in human vision, we propose FOCAL, a test-time robustness framework that transforms the input into the most typical view. At inference time, FOCAL explores a set of transformed images and chooses the one with the highest likelihood under foundation model priors. This test-time optimization boosts robustness while requiring no retraining or architectural changes. Applied to models like CLIP and SAM, it significantly boosts robustness across a wide range of transformations, including 2D and 3D rotations, contrast and lighting shifts, and day-night changes. We also explore potential applications in active vision. By reframing invariance as a test-time optimization problem, FOCAL offers a general and scalable approach to robustness. Our code is available at: https://github.com/sutkarsh/focal.
中文: FOCAL是一种测试时鲁棒性框架,通过将输入转换为最优视角并利用基础模型先验,无需重新训练即可显著提升模型对多种变换的适应能力。
English: FOCAL is a test-time robustness framework that transforms inputs into optimal views using foundation model priors, enhancing adaptability to various transformations without retraining.
Authors:Mohammed Bouri, Adnane Saoud
Abstract:
Despite advancements in Natural Language Processing (NLP), models remain vulnerable to adversarial attacks, such as synonym substitutions. While prior work has focused on improving robustness for feed-forward and convolutional architectures, the robustness of recurrent networks and modern state space models (SSMs), such as S4, remains understudied. These architectures pose unique challenges due to their sequential processing and complex parameter dynamics. In this paper, we introduce a novel regularization technique based on Growth Bound Matrices (GBM) to improve NLP model robustness by reducing the impact of input perturbations on model outputs. We focus on computing the GBM for three architectures: Long Short-Term Memory (LSTM), State Space models (S4), and Convolutional Neural Networks (CNN). Our method aims to (1) enhance resilience against word substitution attacks, (2) improve generalization on clean text, and (3) providing the first systematic analysis of SSM (S4) robustness. Extensive experiments across multiple architectures and benchmark datasets demonstrate that our method improves adversarial robustness by up to 8.8% over existing baselines. These results highlight the effectiveness of our approach, outperforming several state-of-the-art methods in adversarial defense. Codes are available at https://github.com/BouriMohammed/GBM
中文: 本文提出了一种基于增长边界矩阵的新型正则化方法,旨在增强自然语言处理模型对抗攻击的鲁棒性,在LSTM、S4和CNN等多种架构上实现了高达8.8%的防御性能提升。
English: This paper introduces a novel regularization method using Growth Bound Matrices to enhance NLP model robustness against adversarial attacks, achieving up to 8.8% improvement in resilience across multiple architectures including LSTM, S4, and CNN.
Authors:Jiahe Zhao, Rongkun Zheng, Yi Wang, Helin Wang, Hengshuang Zhao
Abstract:
In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input. While linear projectors are widely employed for encapsulation, they introduce semantic indistinctness and temporal incoherence when applied to videos. Conversely, the structure of resamplers shows promise in tackling these challenges, but an effective solution remains unexplored. Drawing inspiration from resampler structures, we introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs. DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring consistent temporal focus of visual tokens to video elements across every video frame. Through extensive experiments on multiple video MLLM frameworks, we demonstrate that DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks, while also achieving higher token efficiency thanks to the reduction of semantic indistinctness. The code: https://github.com/ZJHTerry18/DisCo.
中文摘要:DisCo是一种新颖的视频多模态大语言模型视觉封装方法,通过视觉概念判别器和时序焦点校准器组件,生成语义清晰且时序连贯的视觉标记,在多项视频理解基准测试中显著优于现有方法并实现更高的标记效率。
English Summary: DisCo is a novel visual encapsulation method for video MLLMs that enhances semantic distinctness and temporal coherence through its Visual Concept Discriminator and Temporal Focus Calibrator components, achieving superior performance on video understanding benchmarks with higher token efficiency.
Authors:Muyi Bao, Changyu Zeng, Yifan Wang, Zhengni Yang, Zimu Wang, Guangliang Cheng, Jun Qi, Wei Wang
Abstract:
Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks, largely attributed to their long-range self-attention mechanism and scalability. However, most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions, resulting in suboptimal feature representations. To address this issue, we propose Fuzzy Token Clustering Transformer (FTCFormer), which incorporates a novel clustering-based downsampling module to dynamically generate vision tokens based on the semantic meanings instead of spatial positions. It allocates fewer tokens to less informative regions and more to represent semantically important regions, regardless of their spatial adjacency or shape irregularity. To further enhance feature extraction and representation, we propose a Density Peak Clustering-Fuzzy K-Nearest Neighbor (DPC-FKNN) mechanism for clustering center determination, a Spatial Connectivity Score (SCS) for token assignment, and a channel-wise merging (Cmerge) strategy for token merging. Extensive experiments on 32 datasets across diverse domains validate the effectiveness of FTCFormer on image classification, showing consistent improvements over the TCFormer baseline, achieving gains of improving 1.43% on five fine-grained datasets, 1.09% on six natural image datasets, 0.97% on three medical datasets and 0.55% on four remote sensing datasets. The code is available at: https://github.com/BaoBao0926/FTCFormer/tree/main.
中文: FTCFormer通过基于聚类的下采样模块,依据语义重要性而非空间位置动态生成视觉标记,从而优化特征表示,并在32个不同领域的图像数据集上实现了稳定的性能提升。
English: The FTCFormer introduces a clustering-based downsampling module that dynamically generates vision tokens based on semantic importance rather than spatial grids, enhancing feature representation and achieving consistent performance gains across 32 diverse image datasets.
Authors:Jinglun Li, Kaixun Jiang, Zhaoyu Chen, Bo Lin, Yao Tang, Weifeng Ge, Wenqiang Zhang
Abstract:
Pre-trained vision-language models have exhibited remarkable abilities in detecting out-of-distribution (OOD) samples. However, some challenging OOD samples, which lie close to in-distribution (InD) data in image feature space, can still lead to misclassification. The emergence of foundation models like diffusion models and multimodal large language models (MLLMs) offers a potential solution to this issue. In this work, we propose SynOOD, a novel approach that harnesses foundation models to generate synthetic, challenging OOD data for fine-tuning CLIP models, thereby enhancing boundary-level discrimination between InD and OOD samples. Our method uses an iterative in-painting process guided by contextual prompts from MLLMs to produce nuanced, boundary-aligned OOD samples. These samples are refined through noise adjustments based on gradients from OOD scores like the energy score, effectively sampling from the InD/OOD boundary. With these carefully synthesized images, we fine-tune the CLIP image encoder and negative label features derived from the text encoder to strengthen connections between near-boundary OOD samples and a set of negative labels. Finally, SynOOD achieves state-of-the-art performance on the large-scale ImageNet benchmark, with minimal increases in parameters and runtime. Our approach significantly surpasses existing methods, and the code is available at https://github.com/Jarvisgivemeasuit/SynOOD.
中文: SynOOD利用基础模型通过迭代修复和噪声优化生成合成的边界分布外样本,有效提升了CLIP模型对分布内外数据的区分能力,在ImageNet基准测试中以最小计算成本实现了最优性能。
English: SynOOD leverages foundation models to generate synthetic boundary OOD samples through iterative in-painting and noise refinement, enhancing CLIP's discrimination between in-distribution and out-of-distribution data while achieving state-of-the-art performance on ImageNet with minimal computational overhead.
Authors:Xiangyu Yin, Boyuan Yang, Weichen Liu, Qiyao Xue, Abrar Alamri, Goeran Fiedler, Wei Gao
Abstract:
Prosthetic legs play a pivotal role in clinical rehabilitation, allowing individuals with lower-limb amputations the ability to regain mobility and improve their quality of life. Gait analysis is fundamental for optimizing prosthesis design and alignment, directly impacting the mobility and life quality of individuals with lower-limb amputations. Vision-based machine learning (ML) methods offer a scalable and non-invasive solution to gait analysis, but face challenges in correctly detecting and analyzing prosthesis, due to their unique appearances and new movement patterns. In this paper, we aim to bridge this gap by introducing a multi-purpose dataset, namely ProGait, to support multiple vision tasks including Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis (GA). ProGait provides 412 video clips from four above-knee amputees when testing multiple newly-fitted prosthetic legs through walking trials, and depicts the presence, contours, poses, and gait patterns of human subjects with transfemoral prosthetic legs. Alongside the dataset itself, we also present benchmark tasks and fine-tuned baseline models to illustrate the practical application and performance of the ProGait dataset. We compared our baseline models against pre-trained vision models, demonstrating improved generalizability when applying the ProGait dataset for prosthesis-specific tasks. Our code is available at https://github.com/pittisl/ProGait and dataset at https://huggingface.co/datasets/ericyxy98/ProGait.
中文: 本文提出ProGait数据集,旨在解决基于视觉的假肢步态分析难题,通过提供视频数据和基准测试来提升假肢特定运动的检测与分析能力。
English: This paper introduces the ProGait dataset to address challenges in vision-based gait analysis for prosthetic legs, providing video data and benchmarks to improve detection and analysis of prosthesis-specific movements.
Authors:Shicai Wei, Chunbo Luo, Yang Luo
Abstract:
Multimodal learning often encounters the under-optimized problem and may have worse performance than unimodal learning. Existing methods attribute this problem to the imbalanced learning between modalities and rebalance them through gradient modulation. However, they fail to explain why the dominant modality in multimodal models also underperforms that in unimodal learning. In this work, we reveal the optimization conflict between the modality encoder and modality fusion module in multimodal models. Specifically, we prove that the cross-modal fusion in multimodal models decreases the gradient passed back to each modality encoder compared with unimodal models. Consequently, the performance of each modality in the multimodal model is inferior to that in the unimodal model. To this end, we propose a disentangled gradient learning (DGL) framework to decouple the optimization of the modality encoder and modality fusion module in the multimodal model. DGL truncates the gradient back-propagated from the multimodal loss to the modality encoder and replaces it with the gradient from unimodal loss. Besides, DGL removes the gradient back-propagated from the unimodal loss to the modality fusion module. This helps eliminate the gradient interference between the modality encoder and modality fusion module while ensuring their respective optimization processes. Finally, extensive experiments on multiple types of modalities, tasks, and frameworks with dense cross-modal interaction demonstrate the effectiveness and versatility of the proposed DGL. Code is available at \href{https://github.com/shicaiwei123/ICCV2025-GDL}{https://github.com/shicaiwei123/ICCV2025-GDL}
Chinese: 本研究揭示了多模态学习中模态编码器与融合模块间的优化冲突,提出解耦梯度学习(DGL)框架,通过分离两者的训练过程,在多种任务和模态中有效提升了性能。
English: This study identifies optimization conflicts between modality encoders and fusion modules in multimodal learning, proposing a Disentangled Gradient Learning (DGL) framework to decouple their training and enhance performance across diverse tasks and modalities.
Authors:Huai-Qian Khor, Yante Li, Xingxun Jiang, Guoying Zhao
Abstract:
How much does ethnicity play its part in emotional expression? Emotional expression and micro-expression research probe into understanding human psychological responses to emotional stimuli, thereby revealing substantial hidden yet authentic emotions that can be useful in the event of diagnosis and interviews. While increased attention had been provided to micro-expression analysis, the studies were done under Ekman's assumption of emotion universality, where emotional expressions are identical across cultures and social contexts. Our computational study uncovers some of the influences of ethnic background in expression analysis, leading to an argument that the emotional universality hypothesis is an overgeneralization from the perspective of manual psychological analysis. In this research, we propose to investigate the level of influence of ethnicity in a simulated micro-expression scenario. We construct a cross-cultural micro-expression database and algorithmically annotate the ethnic labels to facilitate the investigation. With the ethnically annotated dataset, we perform a prima facie study to compare mono-ethnicity and stereo-ethnicity in a controlled environment, which uncovers a certain influence of ethnic bias via an experimental way. Building on this finding, we propose a framework that integrates ethnic context into the emotional feature learning process, yielding an ethnically aware framework that recognises ethnicity differences in micro-expression recognition. For improved understanding, qualitative analyses have been done to solidify the preliminary investigation into this new realm of research. Code is publicly available at https://github.com/IcedDoggie/ICMEW2025_EthnicMER
Chinese: 本研究通过揭示微表情分析中的种族影响因素,质疑了情感普遍性假说,并提出一个融合种族背景的识别框架以提升分析准确性。
English: This study challenges the universality of emotional expressions by demonstrating ethnic influences in micro-expression analysis and proposes an ethnically-aware framework to improve recognition accuracy.
Authors:Shicai Wei, Chunbo Luo, Yang Luo
Abstract:
Multimodal learning often encounters the under-optimized problem and may perform worse than unimodal learning. Existing approaches attribute this issue to imbalanced learning across modalities and tend to address it through gradient balancing. However, this paper argues that balanced learning is not the optimal setting for multimodal learning. With bias-variance analysis, we prove that imbalanced dependency on each modality obeying the inverse ratio of their variances contributes to optimal performance. To this end, we propose the Asymmetric Representation Learning(ARL) strategy to assist multimodal learning via imbalanced optimization. ARL introduces auxiliary regularizers for each modality encoder to calculate their prediction variance. ARL then calculates coefficients via the unimodal variance to re-weight the optimization of each modality, forcing the modality dependence ratio to be inversely proportional to the modality variance ratio. Moreover, to minimize the generalization error, ARL further introduces the prediction bias of each modality and jointly optimizes them with multimodal loss. Notably, all auxiliary regularizers share parameters with the multimodal model and rely only on the modality representation. Thus the proposed ARL strategy introduces no extra parameters and is independent of the structures and fusion methods of the multimodal model. Finally, extensive experiments on various datasets validate the effectiveness and versatility of ARL. Code is available at \href{https://github.com/shicaiwei123/ICCV2025-ARL}{https://github.com/shicaiwei123/ICCV2025-ARL}
中文摘要:本文提出非对称表征学习(ARL)策略,通过基于模态方差反比的非平衡优化突破传统梯度平衡的局限,在无需增加参数的情况下实现多模态学习的最优性能,并在多个数据集上验证了其有效性。
English Summary: This paper challenges the conventional gradient balancing approach in multimodal learning by proposing Asymmetric Representation Learning (ARL), which achieves optimal performance through imbalanced modality optimization inversely proportional to their variances, validated across multiple datasets without adding parameters.
Authors:Jaeseong Lee, Yeeun Choi, Heechan Choi, Hanjung Kim, Seonjoo Kim
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition behind ECP is that while MLLMs struggle with high-resolution images, their predictions on downsampled images still contain implicit localization cues. By first identifying candidate region using the coarse prediction and then predicting the final output based on candidate region, ECP effectively preserves fine-grained details while mitigating the challenges posed by high-resolution data. We validate our framework on 4K GUI grounding and 4K, 8K MLLM perception, achieving +21.3%, +5.8%, +5.2% absolute improvement compared to baseline respectively, demonstrating its effectiveness. Code is available at https://github.com/yenncye/ECP.
中文: 提出的ECP框架通过利用粗略预测识别候选区域,无需额外训练即可增强多模态大语言模型在高分辨率图像上的性能,有效保留细节信息。
English: The proposed Extract Candidate then Predict (ECP) framework enhances Multimodal Large Language Models' performance on high-resolution images by leveraging coarse predictions to identify candidate regions, preserving fine-grained details without requiring additional training.
Authors:Shuyu Yang, Yaxiong Wang, Yongrui Li, Li Zhu, Zhedong Zheng
Abstract:
In this work, we focus on text-based person retrieval, which aims to identify individuals based on textual descriptions. Given the significant privacy issues and the high cost associated with manual annotation, synthetic data has become a popular choice for pretraining models, leading to notable advancements. However, the considerable domain gap between synthetic pretraining datasets and real-world target datasets, characterized by differences in lighting, color, and viewpoint, remains a critical obstacle that hinders the effectiveness of the pretrain-finetune paradigm. To bridge this gap, we introduce a unified text-based person retrieval pipeline considering domain adaptation at both image and region levels. In particular, it contains two primary components, i.e., Domain-aware Diffusion (DaD) for image-level adaptation and Multi-granularity Relation Alignment (MRA) for region-level adaptation. As the name implies, Domain-aware Diffusion is to migrate the distribution of images from the pretraining dataset domain to the target real-world dataset domain, e.g., CUHK-PEDES. Subsequently, MRA performs a meticulous region-level alignment by establishing correspondences between visual regions and their descriptive sentences, thereby addressing disparities at a finer granularity. Extensive experiments show that our dual-level adaptation method has achieved state-of-the-art results on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets, outperforming existing methodologies. The dataset, model, and code are available at https://github.com/Shuyu-XJTU/MRA.
中文: 本研究提出了一种统一的基于文本的行人检索流程,通过图像级和区域级的双重领域自适应方法,有效缩小了合成预训练数据与真实数据集之间的领域差异,在多个基准数据集上取得了最优性能。
English: This study introduces a unified text-based person retrieval pipeline that addresses the domain gap between synthetic pretraining data and real-world datasets through dual-level domain adaptation, achieving state-of-the-art performance on benchmark datasets.
Authors:Alireza Dizaji, Benedict Aaron Tjandra, Mehrab Hamidi, Shenyang Huang, Guillaume Rabusseau
Abstract:
Dynamic graph learning methods have recently emerged as powerful tools for modelling relational data evolving through time. However, despite extensive benchmarking efforts, it remains unclear whether current Temporal Graph Neural Networks (TGNNs) effectively capture core temporal patterns such as periodicity, cause-and-effect, and long-range dependencies. In this work, we introduce the Temporal Graph Reasoning Benchmark (T-GRAB), a comprehensive set of synthetic tasks designed to systematically probe the capabilities of TGNNs to reason across time. T-GRAB provides controlled, interpretable tasks that isolate key temporal skills: counting/memorizing periodic repetitions, inferring delayed causal effects, and capturing long-range dependencies over both spatial and temporal dimensions. We evaluate 11 temporal graph learning methods on these tasks, revealing fundamental shortcomings in their ability to generalize temporal patterns. Our findings offer actionable insights into the limitations of current models, highlight challenges hidden by traditional real-world benchmarks, and motivate the development of architectures with stronger temporal reasoning abilities. The code for T-GRAB can be found at: https://github.com/alirezadizaji/T-GRAB.
Chinese: 本文提出了T-GRAB合成基准测试,揭示了当前时序图神经网络在捕捉周期性和因果性等核心时序模式方面的能力缺陷,为开发更强时序推理能力的模型提供了新方向。
English: This paper introduces T-GRAB, a synthetic benchmark that reveals current Temporal Graph Neural Networks' limitations in capturing core temporal patterns like periodicity and causality, despite their widespread use for dynamic relational data.
Authors:Ruihao Gong, Shihao Bai, Siyu Wu, Yunqian Fan, Zaijun Wang, Xiuhong Li, Hailong Yang, Xianglong Liu
Abstract:
The exploration and application of Large Language Models (LLMs) is thriving. To reduce deployment costs, continuous batching has become an essential feature in current service frameworks. The effectiveness of continuous batching relies on an accurate estimate of the memory requirements of requests. However, due to the diversity in request output lengths, existing frameworks tend to adopt aggressive or conservative schedulers, which often result in significant overestimation or underestimation of memory consumption. Consequently, they suffer from harmful request evictions or prolonged queuing times, failing to achieve satisfactory throughput under strict Service Level Agreement (SLA) guarantees (a.k.a. goodput), across various LLM application scenarios with differing input-output length distributions. To address this issue, we propose a novel Past-Future scheduler that precisely estimates the peak memory resources required by the running batch via considering the historical distribution of request output lengths and calculating memory occupancy at each future time point. It adapts to applications with all types of input-output length distributions, balancing the trade-off between request queuing and harmful evictions, thereby consistently achieving better goodput. Furthermore, to validate the effectiveness of the proposed scheduler, we developed a high-performance LLM serving framework, LightLLM, that implements the Past-Future scheduler. Compared to existing aggressive or conservative schedulers, LightLLM demonstrates superior goodput, achieving up to 2-3$\times$ higher goodput than other schedulers under heavy loads. LightLLM is open source to boost the research in such direction (https://github.com/ModelTC/lightllm).
中文摘要:Past-Future调度器通过分析历史输出长度分布和计算未来内存占用,精确预估内存需求,使LightLLM框架在各种大语言模型应用场景中实现了比其他调度器显著更高的优质吞吐量。
English Summary: The Past-Future scheduler accurately estimates memory needs by analyzing historical output length distributions and future memory usage, enabling the LightLLM framework to achieve significantly higher goodput than existing schedulers across diverse LLM applications.
Authors:Zhonglin Liu
Abstract:
Innate resistance to anti-PD-1 immunotherapy remains a major clinical challenge in metastatic melanoma, with the underlying molecular networks being poorly understood. To address this, we constructed a dynamic Probabilistic Boolean Network model using transcriptomic data from patient tumor biopsies to elucidate the regulatory logic governing therapy response. We then employed a reinforcement learning agent to systematically discover optimal, multi-step therapeutic interventions and used explainable artificial intelligence to mechanistically interpret the agent's control policy. The analysis revealed that a precisely timed, 4-step temporary inhibition of the lysyl oxidase like 2 protein (LOXL2) was the most effective strategy. Our explainable analysis showed that this ''hit-and-run" intervention is sufficient to erase the molecular signature driving resistance, allowing the network to self-correct without requiring sustained intervention. This study presents a novel, time-dependent therapeutic hypothesis for overcoming immunotherapy resistance and provides a powerful computational framework for identifying non-obvious intervention protocols in complex biological systems.
Chinese: 本研究利用患者数据构建计算模型,通过强化学习发现精确时机的四步LOXL2蛋白抑制是克服转移性黑色素瘤抗PD-1免疫治疗耐药的最有效策略,该临时干预能消除耐药分子特征并使系统自我修复。
English: This study developed a computational model using patient data and reinforcement learning to identify a precisely timed, four-step inhibition of LOXL2 as the most effective strategy to overcome anti-PD-1 resistance in metastatic melanoma, demonstrating that this temporary intervention erases resistance mechanisms and allows the system to self-correct.
Authors:Ivan MartinoviÄ, Josip Å ariÄ, Marin OrÅ¡iÄ, Matej Kristan, SiniÅ¡a Å egviÄ
Abstract:
Pixel-level annotation is expensive and time-consuming. Semi-supervised segmentation methods address this challenge by learning models on few labeled images alongside a large corpus of unlabeled images. Although foundation models could further account for label scarcity, effective mechanisms for their exploitation remain underexplored. We address this by devising a novel semi-supervised panoptic approach fueled by two dedicated foundation models. We enhance recognition by complementing unsupervised mask-transformer consistency with zero-shot classification of CLIP features. We enhance localization by class-agnostic decoder warm-up with respect to SAM pseudo-labels. The resulting decoupled enhancement of recognition and localization (DEARLi) particularly excels in the most challenging semi-supervised scenarios with large taxonomies and limited labeled data. Moreover, DEARLi outperforms the state of the art in semi-supervised semantic segmentation by a large margin while requiring 8x less GPU memory, in spite of being trained only for the panoptic objective. We observe 29.9 PQ and 38.9 mIoU on ADE20K with only 158 labeled images. The source code is available at https://github.com/helen1c/DEARLi.
中文: DEARLi提出了一种新颖的半监督全景分割方法,通过分别利用基础模型增强识别和定位能力,在标注数据稀缺的情况下表现出色,并以更少的GPU内存大幅超越现有技术水平。
English: DEARLi introduces a novel semi-supervised panoptic segmentation method that enhances recognition and localization separately using foundation models, achieving state-of-the-art results with significantly less GPU memory and excelling in scenarios with limited labeled data.
Authors:Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li, Dongsheng Jiang, Jie Yang, Wei Liu
Abstract:
CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs ($>77$ tokens). To remedy this issue, we propose FIX-CLIP, which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images, respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that FIX-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that FIX-CLIP's text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input. The code is available at https://github.com/bcwang-sjtu/Fix-CLIP.
中文: FIX-CLIP通过双分支训练、区域提示和分层特征对齐增强CLIP的长文本处理能力,在长短文本检索任务中均达到领先水平。
English: FIX-CLIP enhances CLIP's capability for long-text tasks through dual-branch training, regional prompts, and hierarchical feature alignment, achieving state-of-the-art performance in both long and short-text retrieval.
Authors:Meng Yu, Kun Zhan
Abstract:
Diffusion models exhibit impressive generative capabilities but are significantly impacted by exposure bias. In this paper, we make a key observation: the energy of predicted noisy samples in the reverse process continuously declines compared to perturbed samples in the forward process. Building on this, we identify two important findings: 1) The reduction in energy follows distinct patterns in the low-frequency and high-frequency subbands; 2) The subband energy of reverse-process reconstructed samples is consistently lower than that of forward-process ones, and both are lower than the original data samples. Based on the first finding, we introduce a dynamic frequency regulation mechanism utilizing wavelet transforms, which separately adjusts the low- and high-frequency subbands. Leveraging the second insight, we derive the rigorous mathematical form of exposure bias. It is worth noting that, our method is training-free and plug-and-play, significantly improving the generative quality of various diffusion models and frameworks with negligible computational cost. The source code is available at https://github.com/kunzhan/wpp.
中文摘要:本文提出一种无需训练即插即用的方法,通过小波变换动态调节频域子带,以极小计算成本显著提升扩散模型生成质量并有效解决曝光偏差问题。
English Summary: This paper introduces a training-free, plug-and-play method that uses wavelet transforms to dynamically regulate frequency subbands, significantly improving diffusion model performance by addressing exposure bias with minimal computational cost.
Authors:Marc Kaufeld, Mattia Piccinini, Johannes Betz
Abstract:
This research introduces MP-RBFN, a novel formulation leveraging Radial Basis Function Networks for efficiently learning Motion Primitives derived from optimal control problems for autonomous driving. While traditional motion planning approaches based on optimization are highly accurate, they are often computationally prohibitive. In contrast, sampling-based methods demonstrate high performance but impose constraints on the geometric shape of trajectories. MP-RBFN combines the strengths of both by coupling the high-fidelity trajectory generation of sampling-based methods with an accurate description of vehicle dynamics. Empirical results show compelling performance compared to previous methods, achieving a precise description of motion primitives at low inference times. MP-RBFN yields a seven times higher accuracy in generating optimized motion primitives compared to existing semi-analytic approaches. We demonstrate the practical applicability of MP-RBFN for motion planning by integrating the method into a sampling-based trajectory planner. MP-RBFN is available as open-source software at https://github.com/TUM-AVS/RBFN-Motion-Primitives.
中文: 本研究提出MP-RBFN新方法,通过径向基函数网络将基于采样的轨迹生成高保真度与精确车辆动力学建模相结合,在自动驾驶运动基元学习中实现了七倍精度提升和更快的推理速度。
English: This research introduces MP-RBFN, a novel method that combines the high-fidelity trajectory generation of sampling-based approaches with accurate vehicle dynamics modeling using Radial Basis Function Networks, achieving seven times higher accuracy and faster inference times for autonomous driving motion primitives.
Authors:Xianghong Zou, Jianping Li, Zhe Chen, Zhen Cao, Zhen Dong, Qiegen Liu, Bisheng Yang
Abstract:
Point cloud place recognition (PCPR) determines the geo-location within a prebuilt map and plays a crucial role in geoscience and robotics applications such as autonomous driving, intelligent transportation, and augmented reality. In real-world large-scale deployments of a geographic positioning system, PCPR models must continuously acquire, update, and accumulate knowledge to adapt to diverse and dynamic environments, i.e., the ability known as continual learning (CL). However, existing PCPR models often suffer from catastrophic forgetting, leading to significant performance degradation in previously learned scenes when adapting to new environments or sensor types. This results in poor model scalability, increased maintenance costs, and system deployment difficulties, undermining the practicality of PCPR. To address these issues, we propose LifelongPR, a novel continual learning framework for PCPR, which effectively extracts and fuses knowledge from sequential point cloud data. First, to alleviate the knowledge loss, we propose a replay sample selection method that dynamically allocates sample sizes according to each dataset's information quantity and selects spatially diverse samples for maximal representativeness. Second, to handle domain shifts, we design a prompt learning-based CL framework with a lightweight prompt module and a two-stage training strategy, enabling domain-specific feature adaptation while minimizing forgetting. Comprehensive experiments on large-scale public and self-collected datasets are conducted to validate the effectiveness of the proposed method. Compared with state-of-the-art (SOTA) methods, our method achieves 6.50% improvement in mIR@1, 7.96% improvement in mR@1, and an 8.95% reduction in F. The code and pre-trained models are publicly available at https://github.com/zouxianghong/LifelongPR.
中文:LifelongPR框架通过动态回放采样和基于提示学习的领域自适应方法,有效解决了点云地点识别中的灾难性遗忘问题,在多项指标上实现了显著性能提升。
English: The proposed LifelongPR framework addresses catastrophic forgetting in point cloud place recognition through dynamic replay sampling and prompt-based domain adaptation, achieving state-of-the-art performance improvements across multiple metrics.
Authors:Zhifei Xu, Zhiqing Tang, Jiong Lou, Zhi Yao, Xuan Xie, Tian Wang, Yinglong Wang, Weijia Jia
Abstract:
The growth of Artificial Intelligence (AI) and large language models has enabled the use of Generative AI (GenAI) in cloud data centers for diverse AI-Generated Content (AIGC) tasks. Models like Stable Diffusion introduce unavoidable delays and substantial resource overhead, which are unsuitable for users at the network edge with high QoS demands. Deploying AIGC services on edge servers reduces transmission times but often leads to underutilized resources and fails to optimally balance inference latency and quality. To address these issues, this paper introduces a QoS-aware \underline{E}dge-collaborative \underline{A}IGC \underline{T}ask scheduling (EAT) algorithm. Specifically: 1) We segment AIGC tasks and schedule patches to various edge servers, formulating it as a gang scheduling problem that balances inference latency and quality while considering server heterogeneity, such as differing model distributions and cold start issues. 2) We propose a reinforcement learning-based EAT algorithm that uses an attention layer to extract load and task queue information from edge servers and employs a diffusion-based policy network for scheduling, efficiently enabling model reuse. 3) We develop an AIGC task scheduling system that uses our EAT algorithm to divide tasks and distribute them across multiple edge servers for processing. Experimental results based on our system and large-scale simulations show that our EAT algorithm can reduce inference latency by up to 56\% compared to baselines. We release our open-source code at https://github.com/zzf1955/EAT.
中文: 本文提出了一种服务质量感知的边缘协同AIGC任务调度(EAT)算法,通过将AIGC任务分割并利用强化学习在边缘服务器间调度,在平衡延迟和质量的同时,将推理延迟降低了高达56%。
English: This paper introduces the QoS-aware Edge-collaborative AIGC Task scheduling (EAT) algorithm, which segments AIGC tasks and schedules them across edge servers using reinforcement learning to reduce inference latency by up to 56% while balancing latency and quality.
Authors:Samson Yu, Kelvin Lin, Harold Soh
Abstract:
Touch is recognized as a vital sense for humans and an equally important modality for robots, especially for dexterous manipulation, material identification, and scenarios involving visual occlusion. Building upon very recent work in touch foundation models, this demonstration will feature Octopi-1.5, our latest visual-tactile-language model. Compared to its predecessor, Octopi-1.5 introduces the ability to process tactile signals from multiple object parts and employs a simple retrieval-augmented generation (RAG) module to improve performance on tasks and potentially learn new objects on-the-fly. The system can be experienced live through a new handheld tactile-enabled interface, the TMI, equipped with GelSight and TAC-02 tactile sensors. This convenient and accessible setup allows users to interact with Octopi-1.5 without requiring a robot. During the demonstration, we will showcase Octopi-1.5 solving tactile inference tasks by leveraging tactile inputs and commonsense knowledge. For example, in a Guessing Game, Octopi-1.5 will identify objects being grasped and respond to follow-up queries about how to handle it (e.g., recommending careful handling for soft fruits). We also plan to demonstrate Octopi-1.5's RAG capabilities by teaching it new items. With live interactions, this demonstration aims to highlight both the progress and limitations of VTLMs such as Octopi-1.5 and to foster further interest in this exciting field. Code for Octopi-1.5 and design files for the TMI gripper are available at https://github.com/clear-nus/octopi-1.5.
中文: Octopi-1.5 是一款先进的视觉-触觉-语言模型,通过整合多部位物体触觉信号和检索增强生成模块,能够实时识别物体并提供交互式操作建议,用户可通过便捷的手持界面进行体验。
English: Octopi-1.5 is an advanced visual-tactile-language model that enhances tactile processing by integrating multi-part object signals and a retrieval-augmented generation module, enabling real-time object identification and interactive handling recommendations through a user-friendly handheld interface.
Authors:Hang Yuan, Chen Li, Wenjun Ma, Yuncheng Jiang
Abstract:
Hit-like molecular generation with therapeutic potential is essential for target-specific drug discovery. However, the field lacks heterogeneous data and unified frameworks for integrating diverse molecular representations. To bridge this gap, we introduce TextOmics, a pioneering benchmark that establishes one-to-one correspondences between omics expressions and molecular textual descriptions. TextOmics provides a heterogeneous dataset that facilitates molecular generation through representations alignment. Built upon this foundation, we propose ToDi, a generative framework that jointly conditions on omics expressions and molecular textual descriptions to produce biologically relevant, chemically valid, hit-like molecules. ToDi leverages two encoders (OmicsEn and TextEn) to capture multi-level biological and semantic associations, and develops conditional diffusion (DiffGen) for controllable generation. Extensive experiments confirm the effectiveness of TextOmics and demonstrate ToDi outperforms existing state-of-the-art approaches, while also showcasing remarkable potential in zero-shot therapeutic molecular generation. Sources are available at: https://github.com/hala-ToDi.
中文: TextOmics是一个开创性基准,建立了组学表达与分子文本描述之间的一一对应关系,而基于此的ToDi生成框架通过双编码器和条件扩散技术,能高效生成具有生物相关性和化学有效性的候选药物分子。
English: TextOmics is a novel benchmark that links omics data with molecular textual descriptions, enabling the generation of hit-like molecules through the ToDi framework, which utilizes dual encoders and conditional diffusion for superior, controllable molecular creation.
Authors:Zhongyu Ouyang, Mingxuan Ju, Soroush Vosoughi, Yanfang Ye
Abstract:
Graph knowledge has been proven effective in enhancing item rankings in recommender systems (RecSys), particularly during the retrieval stage. However, its application in the ranking stage, especially when richer contextual information in user-item interactions is available, remains underexplored. A major challenge lies in the substantial computational cost associated with repeatedly retrieving neighborhood information from billions of items stored in distributed systems. This resource-intensive requirement makes it difficult to scale graph-based methods in practical RecSys. To bridge this gap, we first demonstrate that incorporating graphs in the ranking stage improves ranking qualities. Notably, while the improvement is evident, we show that the substantial computational overheads entailed by graphs are prohibitively expensive for real-world recommendations. In light of this, we propose a non-parametric strategy that utilizes graph convolution for re-ranking only during test time. Our strategy circumvents the notorious computational overheads from graph convolution during training, and utilizes structural knowledge hidden in graphs on-the-fly during testing. It can be used as a plug-and-play module and easily employed to enhance the ranking ability of various ranking layers of a real-world RecSys with significantly reduced computational overhead. Through comprehensive experiments across four benchmark datasets with varying levels of sparsity, we demonstrate that our strategy yields noticeable improvements (i.e., 8.1% on average) during testing time with little to no additional computational overheads (i.e., 0.5 on average). Code: https://github.com/zyouyang/RecSys2025_NonParamGC.git
图知识能提升推荐系统性能,但在排序阶段计算成本高昂,因此我们提出一种非参数图卷积方法,在测试时进行重排序,以极低开销显著提升效果。
Graph knowledge enhances recommender systems but faces high computational costs in ranking stages, so we propose a non-parametric graph convolution method for re-ranking during testing to improve performance with minimal overhead.
Authors:Shubham Shukla, Kunal Sonalkar
Abstract:
The fashion retail business is centered around the capacity to comprehend products. Product attribution helps in comprehending products depending on the business process. Quality attribution improves the customer experience as they navigate through millions of products offered by a retail website. It leads to well-organized product catalogs. In the end, product attribution directly impacts the 'discovery experience' of the customer. Although large language models (LLMs) have shown remarkable capabilities in understanding multimodal data, their performance on fine-grained fashion attribute recognition remains under-explored. This paper presents a zero-shot evaluation of state-of-the-art LLMs that balance performance with speed and cost efficiency, mainly GPT-4o-mini and Gemini 2.0 Flash. We have used the dataset DeepFashion-MultiModal (https://github.com/yumingj/DeepFashion-MultiModal) to evaluate these models in the attribution tasks of fashion products. Our study evaluates these models across 18 categories of fashion attributes, offering insight into where these models excel. We only use images as the sole input for product information to create a constrained environment. Our analysis shows that Gemini 2.0 Flash demonstrates the strongest overall performance with a macro F1 score of 56.79% across all attributes, while GPT-4o-mini scored a macro F1 score of 43.28%. Through detailed error analysis, our findings provide practical insights for deploying these LLMs in production e-commerce product attribution-related tasks and highlight the need for domain-specific fine-tuning approaches. This work also lays the groundwork for future research in fashion AI and multimodal attribute extraction.
中文: 本研究通过图像输入评估GPT-4o-mini和Gemini 2.0 Flash在细粒度时尚属性识别中的零样本性能,发现Gemini 2.0 Flash以56.79%的宏观F1分数表现更优,同时揭示了电子商务领域应用中对特定领域优化的需求。
English: This study evaluates the zero-shot performance of GPT-4o-mini and Gemini 2.0 Flash on fine-grained fashion attribute recognition using image inputs, finding Gemini 2.0 Flash superior with a 56.79% macro F1 score while highlighting the need for domain-specific optimization in e-commerce applications.
Authors:Zijian Ding, Tung Nguyen, Weikai Li, Aditya Grover, Yizhou Sun, Jason Cong
Abstract:
Deep learning-based prediction models for High-Level Synthesis (HLS) of hardware designs often struggle to generalize. In this paper, we study how to close the generalizability gap of these models through pretraining on synthetic data and introduce Iceberg, a synthetic data augmentation approach that expands both large language model (LLM)-generated programs and weak labels of unseen design configurations. Our weak label generation method is integrated with an in-context model architecture, enabling meta-learning from actual and proximate labels. Iceberg improves the geometric mean modeling accuracy by $86.4\%$ when adapt to six real-world applications with few-shot examples and achieves a $2.47\times$ and a $1.12\times$ better offline DSE performance when adapting to two different test datasets. Our open-sourced code is here: https://github.com/UCLA-VAST/iceberg
中文摘要:Iceberg通过合成数据增强方法,结合大语言模型生成的程序和弱标签,有效缩小了高层次综合中深度学习模型的泛化差距,在少量样本下显著提升建模精度和设计空间探索效率。
English Summary: Iceberg, a synthetic data augmentation method using LLM-generated programs and weak labels, significantly enhances the generalizability of deep learning models in High-Level Synthesis by improving modeling accuracy and design space exploration performance through meta-learning.
Authors:Huilai Li, Yonghao Dang, Ying Xing, Yiming Wang, Jianqin Yin
Abstract:
Dense audio-visual event localization (DAVE) aims to identify event categories and locate the temporal boundaries in untrimmed videos. Most studies only employ event-related semantic constraints on the final outputs, lacking cross-modal semantic bridging in intermediate layers. This causes modality semantic gap for further fusion, making it difficult to distinguish between event-related content and irrelevant background content. Moreover, they rarely consider the correlations between events, which limits the model to infer concurrent events among complex scenarios. In this paper, we incorporate multi-stage semantic guidance and multi-event relationship modeling, which respectively enable hierarchical semantic understanding of audio-visual events and adaptive extraction of event dependencies, thereby better focusing on event-related information. Specifically, our eventaware semantic guided network (ESG-Net) includes a early semantics interaction (ESI) module and a mixture of dependency experts (MoDE) module. ESI applys multi-stage semantic guidance to explicitly constrain the model in learning semantic information through multi-modal early fusion and several classification loss functions, ensuring hierarchical understanding of event-related content. MoDE promotes the extraction of multi-event dependencies through multiple serial mixture of experts with adaptive weight allocation. Extensive experiments demonstrate that our method significantly surpasses the state-of-the-art methods, while greatly reducing parameters and computational load. Our code will be released on https://github.com/uchiha99999/ESG-Net.
Chinese: 本文提出的ESG-Net通过多阶段语义引导和多事件关系建模,改进了密集音视频事件定位的层次化语义理解与事件依赖提取,在显著降低计算负载的同时实现了最先进的性能。
English: This paper introduces ESG-Net, a novel approach for dense audio-visual event localization that integrates multi-stage semantic guidance and multi-event relationship modeling to enhance hierarchical understanding and dependency extraction, achieving superior performance with reduced computational demands.
Authors:Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan
Abstract:
Large language models are susceptible to memorizing repeated sequences, posing privacy and copyright concerns. A popular mitigation strategy is to remove memorized information from specific neurons post-hoc. However, such approaches have shown limited success so far. In a controlled setting, we show that the memorization of natural sequences (those that resemble linguistically plausible text) become mechanistically entangled with general language abilities, thereby becoming challenging to remove post-hoc. In this work, we put forward a new paradigm of MemSinks that promotes isolation of memorization by design. We leverage a sequence identifier that activates a unique set of memorization neurons for each sequence across repetitions. By analyzing the dynamics of learning and forgetting, we argue that MemSinks facilitates isolation of memorized content, making it easier to remove without compromising general language capabilities. We implement MemSinks at the billion-parameter and billion-token scale, and observe both effective isolation and strong generalization. To our knowledge, this is the first proof-of-concept on real data demonstrating that simultaneous generalization and isolation is achievable. We open-source our code at http://github.com/grghosal/MemSinks.
中文: MemSinks框架通过为每个重复序列激活独特的记忆神经元,提出了一种新颖的方法来隔离大型语言模型中的记忆内容,使其易于移除而不损害通用语言能力,同时保持强大的泛化性能。
English: The MemSinks framework introduces a novel approach to isolate memorized sequences in large language models by activating unique neurons for each repeated sequence, enabling effective removal without harming general language abilities while maintaining strong generalization.
Authors:Zhanjiang Yang, Lijun Sun, Jiawei Dong, Xiaoxin An, Yang Liu, Meng Li
Abstract:
Reconstructing hyperspectral images (HSIs) from RGB inputs provides a cost-effective alternative to hyperspectral cameras, but reconstructing high-dimensional spectra from three channels is inherently ill-posed. Existing methods typically directly regress RGB-to-HSI mappings using large attention networks, which are computationally expensive and handle ill-posedness only implicitly. We propose MCGA, a Mixture-of-Codebooks with Grayscale-aware Attention framework that explicitly addresses these challenges using spectral priors and photometric consistency. MCGA first learns transferable spectral priors via a mixture-of-codebooks (MoC) from heterogeneous HSI datasets, then aligns RGB features with these priors through grayscale-aware photometric attention (GANet). Efficiency and robustness are further improved via top-K attention design and test-time adaptation (TTA). Experiments on benchmarks and real-world data demonstrate the state-of-the-art accuracy, strong cross-dataset generalization, and 4-5x faster inference. Codes will be available once acceptance at https://github.com/Fibonaccirabbit/MCGA.
中文摘要:提出的MCGA框架通过利用光谱先验和光度一致性,从RGB输入高效重建高光谱图像,实现了最先进的精度、更快的推理速度和更强的泛化能力。
English Summary: The proposed MCGA framework efficiently reconstructs hyperspectral images from RGB inputs by leveraging spectral priors and photometric consistency, achieving state-of-the-art accuracy with faster inference and better generalization.
Authors:Qinyuan Ye, Robin Jia, Xiang Ren
Abstract:
Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this question through the lens of off-by-one addition (i.e., 1+1=3, 2+2=5, 3+3=?), a two-step, counterfactual task with an unexpected +1 function as a second step. Leveraging circuit-style interpretability techniques such as path patching, we analyze the models' internal computations behind their notable performance and present three key findings. First, we uncover a function induction mechanism that explains the model's generalization from standard addition to off-by-one addition. This mechanism resembles the structure of the induction head mechanism found in prior work and elevates it to a higher level of abstraction. Second, we show that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function. Finally, we find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition. Overall, our findings offer deeper insights into how reusable and composable structures within language models enable task-level generalization.
中文: 本研究通过“错位加法”案例揭示了大型语言模型通过可复用的函数归纳机制实现任务级泛化,多个注意力头并行诱导+1函数,该机制可迁移至合成问答及算法任务中。
English: This study reveals how large language models generalize to unseen tasks through a reusable function induction mechanism, using off-by-one addition as a case to demonstrate parallel attention heads enabling task-level adaptation across various contexts.
Authors:Qinyuan Ye, Robin Jia, Xiang Ren
Abstract:
Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this question through the lens of off-by-one addition (i.e., 1+1=3, 2+2=5, 3+3=?), a two-step, counterfactual task with an unexpected +1 function as a second step. Leveraging circuit-style interpretability techniques such as path patching, we analyze the models' internal computations behind their performance and present three key findings. First, we uncover a function induction mechanism that explains the model's generalization from standard addition to off-by-one addition. This mechanism resembles the structure of the induction head mechanism found in prior work and elevates it to a higher level of abstraction. Second, we show that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function. Finally, we find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition. Overall, our findings offer deeper insights into how reusable and composable structures within language models enable task-level generalization.
中文: 本研究通过“错位加法”案例揭示了大型语言模型通过可复用的函数归纳机制实现任务级泛化,多个注意力头并行诱导+1函数,该机制可迁移至合成问答及算法任务中。
English: This study reveals how large language models generalize to unseen tasks through a reusable function induction mechanism, using off-by-one addition as a case to demonstrate parallel attention heads enabling task-level adaptation across various contexts.
Authors:Jiatong Li, Qi Liu, Mengxiao Zhu
Abstract:
Cognitive diagnosis (CD) models latent cognitive states of human learners by analyzing their response patterns on diagnostic tests, serving as a crucial machine learning technique for educational assessment and evaluation. Traditional cognitive diagnosis models typically follow a transductive prediction paradigm that optimizes parameters to fit response scores and extract learner abilities. These approaches face significant limitations as they cannot perform instant diagnosis for new learners without computationally expensive retraining and produce diagnostic outputs with limited reliability. In this study, we introduces a novel generative diagnosis paradigm that fundamentally shifts CD from predictive to generative modeling, enabling inductive inference of cognitive states without parameter re-optimization. We propose two simple yet effective instantiations of this paradigm: Generative Item Response Theory (G-IRT) and Generative Neural Cognitive Diagnosis Model (G-NCDM), which achieve excellent performance improvements over traditional methods. The generative approach disentangles cognitive state inference from response prediction through a well-designed generation process that incorporates identifiability and monotonicity conditions. Extensive experiments on real-world datasets demonstrate the effectiveness of our methodology in addressing scalability and reliability challenges, especially $\times 100$ speedup for the diagnosis of new learners. Our framework opens new avenues for cognitive diagnosis applications in artificial intelligence, particularly for intelligent model evaluation and intelligent education systems. The code is available at https://github.com/CSLiJT/Generative-CD.git.
中文摘要:本研究提出了一种生成式认知诊断新范式,无需重新训练即可对新学习者进行即时可靠的认知状态评估,相比传统方法实现了百倍加速与性能显著提升。
English Summary: This study introduces a generative cognitive diagnosis paradigm that enables instant, reliable assessment of new learners without retraining, achieving significant performance improvements and a 100x speedup over traditional methods.
Authors:Amirhossein Ansari, Ke Wang, Pulei Xiong
Abstract:
Recent advancements in Vision-Language Models like CLIP have enabled zero-shot OOD detection by leveraging both image and textual label information. Among these, negative label-based methods such as NegLabel and CSP have shown promising results by utilizing a lexicon of words to define negative labels for distinguishing OOD samples. However, these methods suffer from detecting in-distribution samples as OOD due to negative labels that are subcategories of in-distribution labels or proper nouns. They also face limitations in handling images that match multiple in-distribution and negative labels. We propose NegRefine, a novel negative label refinement framework for zero-shot OOD detection. By introducing a filtering mechanism to exclude subcategory labels and proper nouns from the negative label set and incorporating a multi-matching-aware scoring function that dynamically adjusts the contributions of multiple labels matching an image, NegRefine ensures a more robust separation between in-distribution and OOD samples. We evaluate NegRefine on large-scale benchmarks, including ImageNet-1K. The code is available at https://github.com/ah-ansari/NegRefine.
Chinese: NegRefine是一种新颖的负标签优化框架,通过过滤子类别和专有名词并采用动态评分函数,有效提升零样本分布外检测性能,在ImageNet-1K等基准测试中表现优异。
English: NegRefine is a novel framework that refines negative labels by filtering out subcategories and proper nouns and employs a dynamic scoring function to enhance zero-shot out-of-distribution detection, achieving robust performance on benchmarks like ImageNet-1K.
Authors:Paulo Salem, Robert Sim, Christopher Olsen, Prerit Saxena, Rafael Barcelos, Yi Ding
Abstract:
Recent advances in Large Language Models (LLM) have led to a new class of autonomous agents, renewing and expanding interest in the area. LLM-powered Multiagent Systems (MAS) have thus emerged, both for assistive and simulation purposes, yet tools for realistic human behavior simulation -- with its distinctive challenges and opportunities -- remain underdeveloped. Existing MAS libraries and tools lack fine-grained persona specifications, population sampling facilities, experimentation support, and integrated validation, among other key capabilities, limiting their utility for behavioral studies, social simulation, and related applications. To address these deficiencies, in this work we introduce TinyTroupe, a simulation toolkit enabling detailed persona definitions (e.g., nationality, age, occupation, personality, beliefs, behaviors) and programmatic control via numerous LLM-driven mechanisms. This allows for the concise formulation of behavioral problems of practical interest, either at the individual or group level, and provides effective means for their solution. TinyTroupe's components are presented using representative working examples, such as brainstorming and market research sessions, thereby simultaneously clarifying their purpose and demonstrating their usefulness. Quantitative and qualitative evaluations of selected aspects are also provided, highlighting possibilities, limitations, and trade-offs. The approach, though realized as a specific Python implementation, is meant as a novel conceptual contribution, which can be partially or fully incorporated in other contexts. The library is available as open source at https://github.com/microsoft/tinytroupe.
Chinese Summary: TinyTroupe作为一种新型模拟工具包,通过基于大语言模型的驱动机制实现精细人物角色定义和程序化控制,解决了现有多智能体系统在行为模拟方面的不足,为社会科学研究和市场分析等应用提供了有效解决方案。
English Summary: TinyTroupe is a new simulation toolkit that addresses the limitations of existing multiagent systems by enabling detailed persona specifications and programmatic control through LLM-driven mechanisms, facilitating realistic behavioral simulations for applications like social studies and market research.
Authors:Junaid Iqbal Khan
Abstract:
Approximate machine unlearning (AMU) enables models to `forget' specific training data through specialized fine-tuning on a retained (and forget) subset of training set. However, processing this large retained subset still dominates computational runtime, while reductions of unlearning epochs also remain a challenge. In this paper, we propose two complementary methods to accelerate arbitrary classification-oriented AMU method. First, \textbf{Blend}, a novel distribution-matching dataset condensation (DC), merges visually similar images with shared blend-weights to significantly reduce the retained set size. It operates with minimal pre-processing overhead and is orders of magnitude faster than state-of-the-art DC methods. Second, our loss-centric method, \textbf{Accelerated-AMU (A-AMU)}, augments the AMU objective to quicken convergence. A-AMU achieves this by combining a steepened primary loss to expedite forgetting with a differentiable regularizer that matches the loss distributions of forgotten and in-distribution unseen data. Our extensive experiments demonstrate that this dual approach of data and loss-centric optimization dramatically reduces end-to-end unlearning latency across both single and multi-round scenarios, all while preserving model utility and privacy. To our knowledge, this is the first work to systematically tackle unlearning efficiency by jointly designing a specialized dataset condensation technique with a dedicated accelerated loss function. Code is available at https://github.com/algebraicdianuj/DC_Unlearning.
中文: 本文提出Blend方法通过合并相似图像压缩保留数据集规模,并结合A-AMU损失增强技术加速收敛,在保持模型效用与隐私的同时,显著降低了机器遗忘的整体耗时。
English: This paper introduces Blend, a dataset condensation method that merges similar images to reduce retained data size, and A-AMU, a loss-augmentation technique that accelerates convergence, together significantly cutting machine unlearning time while maintaining model performance and privacy.
Authors:Mihir Kavishwar, Naresh Shanbhag
Abstract:
Analog in-memory computing (AIMC) is an energy-efficient alternative to digital architectures for accelerating machine learning and signal processing workloads. However, its energy efficiency is limited by the high energy cost of the column analog-to-digital converters (ADCs). Reducing the ADC precision is an effective approach to lowering its energy cost. However, doing so also reduces the AIMC's computational accuracy thereby making it critical to identify the minimum precision required to meet a target accuracy. Prior works overestimate the ADC precision requirements by modeling quantization error as input-independent noise, maximizing the signal-to-quantization-noise ratio (SQNR), and ignoring the discrete nature of ideal pre-ADC signal. We address these limitations by developing analytical expressions for estimating the compute signal-to-noise ratio (CSNR), a true metric of accuracy for AIMCs, and propose CACTUS, an algorithm to obtain CSNR-optimal ADC parameters. Using a circuit-aware behavioral model of an SRAM-based AIMC in a 28nm CMOS process, we show that for a 256-dimensional binary dot product, CACTUS reduces the ADC precision requirements by 3b while achieving 6dB higher CSNR over prior methods. We also delineate operating conditions under which our proposed CSNR-optimal ADCs outperform conventional SQNR-optimal ADCs.
中文摘要:模拟内存计算因高精度模数转换器能耗受限,而CACTUS算法在保持计算精度的前提下,将模数转换器精度要求降低3比特,相比现有方法提升计算信噪比6分贝。
English Summary: Analog in-memory computing faces energy limitations from high-precision ADCs, but the proposed CACTUS algorithm reduces ADC precision by 3 bits while improving computational accuracy by 6dB over prior methods.
Authors:Abdul Manaf, Nimra Mughal
Abstract:
Pneumonia is a leading cause of mortality in children under five, requiring accurate chest X-ray diagnosis. This study presents a machine learning-based Pediatric Chest Pneumonia Classification System to assist healthcare professionals in diagnosing pneumonia from chest X-ray images. The CNN-based model was trained on 5,863 labeled chest X-ray images from children aged 0-5 years from the Guangzhou Women and Children's Medical Center. To address limited data, we applied augmentation techniques (rotation, zooming, shear, horizontal flipping) and employed GANs to generate synthetic images, addressing class imbalance. The system achieved optimal performance using combined original, augmented, and GAN-generated data, evaluated through accuracy and F1 score metrics. The final model was deployed via a Flask web application, enabling real-time classification with probability estimates. Results demonstrate the potential of deep learning and GANs in improving diagnostic accuracy and efficiency for pediatric pneumonia classification, particularly valuable in resource-limited clinical settings https://github.com/AbdulManaf12/Pediatric-Chest-Pneumonia-Classification
中文: 本研究开发了一种基于CNN和GAN的机器学习系统,通过胸部X光片对儿童肺炎进行分类,实现了高诊断准确性,并通过网络应用程序部署供临床使用。
English: This study developed a machine learning system using CNN and GANs to classify pediatric pneumonia from chest X-rays, achieving high diagnostic accuracy and deployment via a web application for clinical use.
Authors:Dongyang Li, Haoyang Qin, Mingyang Wu, Chen Wei, Quanying Liu
Abstract:
Understanding how the brain represents visual information is a fundamental challenge in neuroscience and artificial intelligence. While AI-driven decoding of neural data has provided insights into the human visual system, integrating multimodal neuroimaging signals, such as EEG, MEG, and fMRI, remains a critical hurdle due to their inherent spatiotemporal misalignment. Current approaches often analyze these modalities in isolation, limiting a holistic view of neural representation. In this study, we introduce BrainFLORA, a unified framework for integrating cross-modal neuroimaging data to construct a shared neural representation. Our approach leverages multimodal large language models (MLLMs) augmented with modality-specific adapters and task decoders, achieving state-of-the-art performance in joint-subject visual retrieval task and has the potential to extend multitasking. Combining neuroimaging analysis methods, we further reveal how visual concept representations align across neural modalities and with real world object perception. We demonstrate that the brain's structured visual concept representations exhibit an implicit mapping to physical-world stimuli, bridging neuroscience and machine learning from different modalities of neural imaging. Beyond methodological advancements, BrainFLORA offers novel implications for cognitive neuroscience and brain-computer interfaces (BCIs). Our code is available at https://github.com/ncclab-sustech/BrainFLORA.
中文: 本研究提出BrainFLORA统一框架,通过增强多模态大语言模型整合跨模态神经影像数据,构建共享神经表征,在视觉检索任务中实现最优性能,并揭示了大脑表征与现实世界感知的对齐机制。
English: This study introduces BrainFLORA, a unified framework that integrates multimodal neuroimaging data using augmented large language models to create shared neural representations, achieving top performance in visual retrieval and revealing how brain representations align with real-world perception.
Authors:Dongyang Li, Haoyang Qin, Mingyang Wu, Chen Wei, Quanying Liu
Abstract:
Understanding how the brain represents visual information is a fundamental challenge in neuroscience and artificial intelligence. While AI-driven decoding of neural data has provided insights into the human visual system, integrating multimodal neuroimaging signals, such as EEG, MEG, and fMRI, remains a critical hurdle due to their inherent spatiotemporal misalignment. Current approaches often analyze these modalities in isolation, limiting a holistic view of neural representation. In this study, we introduce BrainFLORA, a unified framework for integrating cross-modal neuroimaging data to construct a shared neural representation. Our approach leverages multimodal large language models (MLLMs) augmented with modality-specific adapters and task decoders, achieving state-of-the-art performance in joint-subject visual retrieval task and has the potential to extend multitasking. Combining neuroimaging analysis methods, we further reveal how visual concept representations align across neural modalities and with real world object perception. We demonstrate that the brain's structured visual concept representations exhibit an implicit mapping to physical-world stimuli, bridging neuroscience and machine learning from different modalities of neural imaging. Beyond methodological advancements, BrainFLORA offers novel implications for cognitive neuroscience and brain-computer interfaces (BCIs). Our code is available at https://github.com/ncclab-sustech/BrainFLORA.
中文: 本研究提出BrainFLORA统一框架,通过增强多模态大语言模型整合跨模态神经影像数据,构建共享神经表征,在视觉检索任务中实现最优性能,并揭示了大脑表征与现实世界感知的对齐机制。
English: This study introduces BrainFLORA, a unified framework that integrates multimodal neuroimaging data using augmented large language models to create shared neural representations, achieving top performance in visual retrieval and revealing how brain representations align with real-world perception.
Authors:Xinyu Zhang, Zhonghao Ye, Jingwei Zhang, Xiang Tian, Zhisheng Liang, Shipeng Yu
Abstract:
WiFi-based human pose estimation has emerged as a promising non-visual alternative approaches due to its pene-trability and privacy advantages. This paper presents VST-Pose, a novel deep learning framework for accurate and continuous pose estimation using WiFi channel state information. The proposed method introduces ViSTA-Former, a spatiotemporal attention backbone with dual-stream architecture that adopts a dual-stream architecture to separately capture temporal dependencies and structural relationships among body joints. To enhance sensitivity to subtle human motions, a velocity modeling branch is integrated into the framework, which learns short-term keypoint dis-placement patterns and improves fine-grained motion representation. We construct a 2D pose dataset specifically designed for smart home care scenarios and demonstrate that our method achieves 92.2% accuracy on the PCK@50 metric, outperforming existing methods by 8.3% in PCK@50 on the self-collected dataset. Further evaluation on the public MMFi dataset confirms the model's robustness and effectiveness in 3D pose estimation tasks. The proposed system provides a reliable and privacy-aware solution for continuous human motion analysis in indoor environments. Our codes are available in https://github.com/CarmenQing/VST-Pose.
中文:VST-Pose是一种基于WiFi信号的新型深度学习框架,通过速度建模增强了对细微动作的捕捉能力,在姿态估计任务中达到92.2%的准确率,为室内人体运动分析提供了可靠的隐私保护方案。
English: VST-Pose is a novel deep learning framework using WiFi signals that achieves 92.2% pose estimation accuracy with enhanced motion sensitivity through velocity modeling, offering a privacy-preserving solution for indoor human motion analysis.
Authors:Zhengyuan Peng, Jianqing Xu, Shen Li, Jiazhen Ji, Yuge Huang, Jingyun Zhang, Jinmin Li, Shouhong Ding, Rizen Guo, Xin Tan, Lizhuang Ma
Abstract:
Human-machine interaction through augmented reality (AR) and virtual reality (VR) is increasingly prevalent, requiring accurate and efficient gaze estimation which hinges on the accuracy of eye segmentation to enable smooth user experiences. We introduce EyeSeg, a novel eye segmentation framework designed to overcome key challenges that existing approaches struggle with: motion blur, eyelid occlusion, and train-test domain gaps. In these situations, existing models struggle to extract robust features, leading to suboptimal performance. Noting that these challenges can be generally quantified by uncertainty, we design EyeSeg as an uncertainty-aware eye segmentation framework for AR/VR wherein we explicitly model the uncertainties by performing Bayesian uncertainty learning of a posterior under the closed set prior. Theoretically, we prove that a statistic of the learned posterior indicates segmentation uncertainty levels and empirically outperforms existing methods in downstream tasks, such as gaze estimation. EyeSeg outputs an uncertainty score and the segmentation result, weighting and fusing multiple gaze estimates for robustness, which proves to be effective especially under motion blur, eyelid occlusion and cross-domain challenges. Moreover, empirical results suggest that EyeSeg achieves segmentation improvements of MIoU, E1, F1, and ACC surpassing previous approaches. The code is publicly available at https://github.com/JethroPeng/EyeSeg.
中文摘要:EyeSeg是一种面向AR/VR的不确定性感知眼部分割框架,通过贝叶斯学习建模不确定性,有效解决了运动模糊、眼睑遮挡和领域差异等关键挑战,在分割精度和视线估计鲁棒性方面均优于现有方法。
English Summary: EyeSeg is an uncertainty-aware eye segmentation framework for AR/VR that addresses motion blur, eyelid occlusion, and domain gaps by modeling uncertainties through Bayesian learning, achieving superior segmentation performance and gaze estimation robustness.
Authors:Taniv Ashraf
Abstract:
The advent of powerful, accessible Large Language Models (LLMs) like Google's Gemini presents new opportunities for democratizing financial data analysis. This paper documents the design, implementation, and iterative debugging of a novel, serverless system for real-time stock analysis. The system leverages the Gemini API for qualitative assessment, automates data ingestion and processing via GitHub Actions, and presents the findings through a decoupled, static frontend. We detail the architectural evolution of the system, from initial concepts to a robust, event-driven pipeline, highlighting the practical challenges encountered during deployment. A significant portion of this paper is dedicated to a case study on the debugging process, covering common software errors, platform-specific permission issues, and rare, environment-level platform bugs. The final architecture operates at a near-zero cost, demonstrating a viable model for individuals to build sophisticated AI-powered financial tools. The operational application is publicly accessible, and the complete source code is available for review. We conclude by discussing the role of LLMs in financial analysis, the importance of robust debugging methodologies, and the emerging paradigm of human-AI collaboration in software development.
中文: 本文介绍了一种利用谷歌Gemini进行实时股票分析的无服务器系统,详细阐述了从概念到经济高效的事件驱动架构的开发过程,使个人能够构建AI驱动的金融工具。
English: This paper presents a serverless system using Google's Gemini for real-time stock analysis, detailing its development from concept to a cost-effective, event-driven architecture that enables individuals to build AI-powered financial tools.
Authors:Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Junjie Hu
Abstract:
Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR
中文摘要:MENTOR框架通过自回归方法和两阶段训练实现了多模态输入与图像输出的细粒度对齐,在有限资源下仍展现出卓越的生成性能与训练效率,超越了现有基准方法。
English Summary: The MENTOR framework introduces an autoregressive approach with two-stage training to enhance multimodal image generation by achieving fine-grained alignment between inputs and outputs, demonstrating superior performance and efficiency over existing methods despite limited resources.
Authors:Bolun Zheng, Xinjie Liu, Qianyu Zhang, Canjin Wang, Fangni Chen, Mingen Xu
Abstract:
3D hand pose estimation has garnered great attention in recent years due to its critical applications in human-computer interaction, virtual reality, and related fields. The accurate estimation of hand joints is essential for high-quality hand pose estimation. However, existing methods neglect the importance of Distal Phalanx Tip (TIP) and Wrist in predicting hand joints overall and often fail to account for the phenomenon of error accumulation for distal joints in gesture estimation, which can cause certain joints to incur larger errors, resulting in misalignments and artifacts in the pose estimation and degrading the overall reconstruction quality. To address this challenge, we propose a novel segmented architecture for enhanced hand pose estimation (EHPE). We perform local extraction of TIP and wrist, thus alleviating the effect of error accumulation on TIP prediction and further reduce the predictive errors for all joints on this basis. EHPE consists of two key stages: In the TIP and Wrist Joints Extraction stage (TW-stage), the positions of the TIP and wrist joints are estimated to provide an initial accurate joint configuration; In the Prior Guided Joints Estimation stage (PG-stage), a dual-branch interaction network is employed to refine the positions of the remaining joints. Extensive experiments on two widely used benchmarks demonstrate that EHPE achieves state-of-the-arts performance. Code is available at https://github.com/SereinNout/EHPE.
中文摘要:本文提出的EHPE方法通过分段式架构分别提取指尖和手腕关节再优化其余关节,有效解决了三维手势估计中的误差累积问题,在多个基准测试中达到了最优性能。
English Summary: The proposed EHPE method addresses error accumulation in 3D hand pose estimation by introducing a segmented architecture that separately extracts distal phalanx tip and wrist joints before refining remaining joints, achieving state-of-the-art performance on benchmarks.
Authors:Ximeng Zhai, Bohan Xu, Yaohong Chen, Hao Wang, Kehua Guo, Yimian Dai
Abstract:
Due to the limitation of the optical lens focal length and the resolution of the infrared detector, distant Closely-Spaced Infrared Small Target (CSIST) groups typically appear as mixing spots in the infrared image. In this paper, we propose a novel task, Sequential CSIST Unmixing, namely detecting all targets in the form of sub-pixel localization from a highly dense CSIST group. However, achieving such precise detection is an extremely difficult challenge. In addition, the lack of high-quality public datasets has also restricted the research progress. To this end, firstly, we contribute an open-source ecosystem, including SeqCSIST, a sequential benchmark dataset, and a toolkit that provides objective evaluation metrics for this special task, along with the implementation of 23 relevant methods. Furthermore, we propose the Deformable Refinement Network (DeRefNet), a model-driven deep learning framework that introduces a Temporal Deformable Feature Alignment (TDFA) module enabling adaptive inter-frame information aggregation. To the best of our knowledge, this work is the first endeavor to address the CSIST Unmixing task within a multi-frame paradigm. Experiments on the SeqCSIST dataset demonstrate that our method outperforms the state-of-the-art approaches with mean Average Precision (mAP) metric improved by 5.3\%. Our dataset and toolkit are available from https://github.com/GrokCV/SeqCSIST.
Chinese: 本文提出了序列化密集红外小目标分离任务,通过亚像素定位检测目标,并开发了DeRefNet框架和新数据集,在mAP指标上比现有最佳方法提升了5.3%。
English: This paper introduces the Sequential CSIST Unmixing task for detecting closely-spaced infrared small targets through sub-pixel localization and presents the DeRefNet framework with a novel dataset, achieving a 5.3% mAP improvement over existing methods.
Authors:Zihao Xiong, Fei Zhou, Fengyi Wu, Shuai Yuan, Maixia Fu, Zhenming Peng, Jian Yang, Yimian Dai
Abstract:
Infrared small target detection plays a vital role in remote sensing, industrial monitoring, and various civilian applications. Despite recent progress powered by deep learning, many end-to-end convolutional models tend to pursue performance by stacking increasingly complex architectures, often at the expense of interpretability, parameter efficiency, and generalization. These models typically overlook the intrinsic sparsity prior of infrared small targets--an essential cue that can be explicitly modeled for both performance and efficiency gains. To address this, we revisit the model-based paradigm of Robust Principal Component Analysis (RPCA) and propose Dynamic RPCA Network (DRPCA-Net), a novel deep unfolding network that integrates the sparsity-aware prior into a learnable architecture. Unlike conventional deep unfolding methods that rely on static, globally learned parameters, DRPCA-Net introduces a dynamic unfolding mechanism via a lightweight hypernetwork. This design enables the model to adaptively generate iteration-wise parameters conditioned on the input scene, thereby enhancing its robustness and generalization across diverse backgrounds. Furthermore, we design a Dynamic Residual Group (DRG) module to better capture contextual variations within the background, leading to more accurate low-rank estimation and improved separation of small targets. Extensive experiments on multiple public infrared datasets demonstrate that DRPCA-Net significantly outperforms existing state-of-the-art methods in detection accuracy. Code is available at https://github.com/GrokCV/DRPCA-Net.
中文: 本文提出的动态RPCA网络(DRPCA-Net)通过深度展开架构融合稀疏性先验知识,并采用自适应参数生成机制,在多种复杂背景下实现了显著优于现有方法的红外小目标检测精度。
English: The proposed Dynamic RPCA Network (DRPCA-Net) integrates sparsity-aware priors through a deep unfolding architecture with adaptive parameter generation, achieving superior infrared small target detection accuracy across diverse backgrounds.
Authors:Yunwei Lan, Zhigao Cui, Xin Luo, Chang Liu, Nian Wang, Menglin Zhang, Yanzhao Su, Dong Liu
Abstract:
Recent advancements in unpaired dehazing, particularly those using GANs, show promising performance in processing real-world hazy images. However, these methods tend to face limitations due to the generator's limited transport mapping capability, which hinders the full exploitation of their effectiveness in unpaired training paradigms. To address these challenges, we propose DehazeSB, a novel unpaired dehazing framework based on the Schrödinger Bridge. By leveraging optimal transport (OT) theory, DehazeSB directly bridges the distributions between hazy and clear images. This enables optimal transport mappings from hazy to clear images in fewer steps, thereby generating high-quality results. To ensure the consistency of structural information and details in the restored images, we introduce detail-preserving regularization, which enforces pixel-level alignment between hazy inputs and dehazed outputs. Furthermore, we propose a novel prompt learning to leverage pre-trained CLIP models in distinguishing hazy images and clear ones, by learning a haze-aware vision-language alignment. Extensive experiments on multiple real-world datasets demonstrate our method's superiority. Code: https://github.com/ywxjm/DehazeSB.
Chinese: DehazeSB提出了一种基于薛定谔桥的新型非配对去雾框架,利用最优传输理论直接连接有雾与清晰图像的分布,结合细节保留正则化和提示学习,在多个真实数据集上展现出卓越性能。
English: DehazeSB introduces a novel unpaired dehazing framework using the Schrödinger Bridge and optimal transport theory to efficiently map hazy to clear images with enhanced detail preservation and haze-aware prompt learning, achieving superior results on real-world datasets.
Authors:Yiwen Liang, Hui Chen, Yizhe Xiong, Zihan Zhou, Mengyao Lyu, Zijia Lin, Shuaicheng Niu, Sicheng Zhao, Jungong Han, Guiguang Ding
Abstract:
Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable, which has motivated the development of Test-Time Adaptation (TTA) to improve VLMs' performance during inference without annotations. Among various TTA approaches, cache-based methods show promise by preserving historical knowledge from low-entropy samples in a dynamic cache and fostering efficient adaptation. However, these methods face two critical reliability challenges: (1) entropy often becomes unreliable under distribution shifts, causing error accumulation in the cache and degradation in adaptation performance; (2) the final predictions may be unreliable due to inflexible decision boundaries that fail to accommodate large downstream shifts. To address these challenges, we propose a Reliable Test-time Adaptation (ReTA) method that integrates two complementary strategies to enhance reliability from two perspectives. First, to mitigate the unreliability of entropy as a sample selection criterion for cache construction, we introduce Consistency-aware Entropy Reweighting (CER), which incorporates consistency constraints to weight entropy during cache updating. While conventional approaches rely solely on low entropy for cache prioritization and risk introducing noise, our method leverages predictive consistency to maintain a high-quality cache and facilitate more robust adaptation. Second, we present Diversity-driven Distribution Calibration (DDC), which models class-wise text embeddings as multivariate Gaussian distributions, enabling adaptive decision boundaries for more accurate predictions across visually diverse content. Extensive experiments demonstrate that ReTA consistently outperforms state-of-the-art methods, particularly under real-world distribution shifts. Code: https://github.com/Evelyn1ywliang/ReTA.
中文: 针对分布偏移下视觉语言模型因熵值样本选择不可靠和决策边界僵化导致的性能下降问题,提出的ReTA方法通过一致性感知熵加权提升缓存质量,并利用多样性驱动的分布校准实现自适应预测,显著提升模型鲁棒性。
English: Vision-language models face reliability issues under distribution shifts due to unreliable entropy-based sample selection and rigid decision boundaries, which the proposed ReTA method addresses through consistency-aware entropy reweighting for cache quality and diversity-driven distribution calibration for adaptive predictions.
Authors:Changli Wang, Rui Wu, Fang Yin
Abstract:
Human emotions are complex, with sarcasm being a subtle and distinctive form. Despite progress in sarcasm research, sarcasm generation remains underexplored, primarily due to the overreliance on textual modalities and the neglect of visual cues, as well as the mismatch between image content and sarcastic intent in existing datasets. In this paper, we introduce M2SaG, a multimodal sarcasm generation dataset with 4,970 samples, each containing an image, a sarcastic text, and a sarcasm target. To benchmark M2SaG, we propose ViSP, a generation framework that integrates Proximal Policy Optimization (PPO) and contrastive learning. PPO utilizes reward scores from DIP to steer the generation of sarcastic texts, while contrastive learning encourages the model to favor outputs with higher reward scores. These strategies improve overall generation quality and produce texts with more pronounced sarcastic intent. We evaluate ViSP across five metric sets and find it surpasses all baselines, including large language models, underscoring their limitations in sarcasm generation. Furthermore, we analyze the distributions of Sarcasm Scores and Factual Incongruity for both M2SaG and the texts generated by ViSP. The generated texts exhibit higher mean Sarcasm Scores (0.898 vs. 0.770) and Factual Incongruity (0.768 vs. 0.739), demonstrating that ViSP produces higher-quality sarcastic content than the original dataset. % The dataset and code will be publicly available. Our dataset and code will be released at \textit{https://github.com/wclapply/ViSP}.
Chinese: 本文提出了多模态讽刺生成数据集M2SaG和ViSP框架,该框架通过PPO和对比学习提升讽刺文本生成质量,在包括大语言模型在内的基准测试中表现优异。
English: This paper introduces M2SaG, a multimodal sarcasm generation dataset, and ViSP, a framework that enhances sarcastic text generation through PPO and contrastive learning, outperforming baselines including large language models.
Authors:Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu
Abstract:
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at https://github.com/DavidZWZ/Awesome-RAG-Reasoning.
中文: 本综述将检索增强生成与推理方法整合于统一框架下,阐释了高级推理如何优化RAG各阶段及检索知识如何支撑复杂推理,同时聚焦新兴的协同框架并展望未来研究方向。
English: This survey integrates retrieval-augmented generation and reasoning methods under a unified framework, demonstrating how advanced reasoning enhances RAG and how retrieved knowledge supports complex inference, while highlighting emerging synergistic approaches and future research directions.
Authors:Yuanhong Zheng, Ruixuan Yu, Jian Sun
Abstract:
3D multi-person motion prediction is a highly complex task, primarily due to the dependencies on both individual past movements and the interactions between agents. Moreover, effectively modeling these interactions often incurs substantial computational costs. In this work, we propose a computationally efficient model for multi-person motion prediction by simplifying spatial and temporal interactions. Our approach begins with the design of lightweight dual branches that learn local and global representations for individual and multiple persons separately. Additionally, we introduce a novel cross-level interaction block to integrate the spatial and temporal representations from both branches. To further enhance interaction modeling, we explicitly incorporate the spatial inter-person distance embedding. With above efficient temporal and spatial design, we achieve state-of-the-art performance for multiple metrics on standard datasets of CMU-Mocap, MuPoTS-3D, and 3DPW, while significantly reducing the computational cost. Code is available at https://github.com/Yuanhong-Zheng/EMPMP.
中文: 本文提出了一种计算高效的3D多人运动预测模型,通过轻量级双分支结构和新型跨层级交互模块简化时空交互,在标准数据集上实现最优性能的同时显著降低了计算成本。
English: This paper introduces a computationally efficient model for 3D multi-person motion prediction by simplifying spatial and temporal interactions through lightweight dual branches and a novel cross-level interaction block, achieving state-of-the-art performance on standard datasets while significantly reducing computational costs.
Authors:Ankit Sanjyal
Abstract:
High-resolution image synthesis with diffusion models often suffers from energy instabilities and guidance artifacts that degrade visual quality. We analyze the latent energy landscape during sampling and propose adaptive classifier-free guidance (CFG) schedules that maintain stable energy trajectories. Our approach introduces energy-aware scheduling strategies that modulate guidance strength over time, achieving superior stability scores (0.9998) and consistency metrics (0.9873) compared to fixed-guidance approaches. We demonstrate that DPM++ 2M with linear-decreasing CFG scheduling yields optimal performance, providing sharper, more faithful images while reducing artifacts. Our energy profiling framework serves as a powerful diagnostic tool for understanding and improving diffusion model behavior.
中文摘要:本研究通过提出自适应无分类器引导调度策略,动态调整引导强度以解决扩散模型图像合成中的能量不稳定和伪影问题,相比固定引导方法获得了更优的稳定性和图像质量。
English Summary: This study addresses energy instabilities and artifacts in diffusion-based image synthesis by introducing adaptive classifier-free guidance schedules that dynamically adjust guidance strength, achieving superior stability and image quality compared to fixed approaches.
Authors:Peter Pao-Huang, Mitchell Black, Xiaojie Qiu
Abstract:
Generative modeling of graphs with spatial structure is essential across many applications from computer graphics to spatial genomics. Recent flow-based generative models have achieved impressive results by gradually adding and then learning to remove noise from these graphs. Existing models, however, use graph neural network architectures that are independent of the noise level, limiting their expressiveness. To address this issue, we introduce \textit{Noise-Conditioned Graph Networks} (NCGNs), a class of graph neural networks that dynamically modify their architecture according to the noise level during generation. Our theoretical and empirical analysis reveals that as noise increases, (1) graphs require information from increasingly distant neighbors and (2) graphs can be effectively represented at lower resolutions. Based on these insights, we develop Dynamic Message Passing (DMP), a specific instantiation of NCGNs that adapts both the range and resolution of message passing to the noise level. DMP consistently outperforms noise-independent architectures on a variety of domains including $3$D point clouds, spatiotemporal transcriptomics, and images. Code is available at https://github.com/peterpaohuang/ncgn.
Chinese: 本文提出噪声条件图网络(NCGNs),通过根据噪声水平动态调整网络架构来增强图生成建模,在三维点云和空间基因组学等多个领域均优于现有方法。
English: This paper introduces Noise-Conditioned Graph Networks (NCGNs), which dynamically adjust their architecture based on noise levels to enhance graph generative modeling, outperforming existing methods across multiple domains including 3D point clouds and spatial genomics.
Authors:Bojian Hou, Zhanliang Wang, Zhuoping Zhou, Boning Tong, Zexuan Wang, Jingxuan Bao, Duy Duong-Tran, Qi Long, Li Shen
Abstract:
Canonical correlation analysis (CCA) is a technique for finding correlations between different data modalities and learning low-dimensional representations. As fairness becomes crucial in machine learning, fair CCA has gained attention. However, previous approaches often overlook the impact on downstream classification tasks, limiting applicability. We propose a novel fair CCA method for fair representation learning, ensuring the projected features are independent of sensitive attributes, thus enhancing fairness without compromising accuracy. We validate our method on synthetic data and real-world data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), demonstrating its ability to maintain high correlation analysis performance while improving fairness in classification tasks. Our work enables fair machine learning in neuroimaging studies where unbiased analysis is essential. Code is available in https://github.com/ZhanliangAaronWang/FR-CCA-ADNI.
中文: 本文提出了一种新颖的公平典型相关分析方法,确保投影特征与敏感属性无关,在阿尔茨海默病神经影像学计划数据验证中既提升了分类公平性又不损害准确性。
English: This paper introduces a novel fair canonical correlation analysis method that ensures feature independence from sensitive attributes, enhancing fairness while maintaining accuracy in classification tasks, as validated on synthetic and Alzheimer's Disease Neuroimaging Initiative data.
Authors:Sourish Suri, Yifei Shao
Abstract:
Crop diseases present a significant barrier to agricultural productivity and global food security, especially in large-scale farming where early identification is often delayed or inaccurate. This research introduces a Convolutional Neural Network (CNN)-based image classification system designed to automate the detection and classification of eight common crop diseases using leaf imagery. The methodology involves a complete deep learning pipeline: image acquisition from a large, labeled dataset, preprocessing via resizing, normalization, and augmentation, and model training using TensorFlow with Keras' Sequential API. The CNN architecture comprises three convolutional layers with increasing filter sizes and ReLU activations, followed by max pooling, flattening, and fully connected layers, concluding with a softmax output for multi-class classification. The system achieves high training accuracy (~90%) and demonstrates reliable performance on unseen data, although a validation accuracy of ~60% suggests minor overfitting. Notably, the model integrates a treatment recommendation module, providing actionable guidance by mapping each detected disease to suitable pesticide or fungicide interventions. Furthermore, the solution is deployed on an open-source, mobile-compatible platform, enabling real-time image-based diagnostics for farmers in remote areas. This research contributes a scalable and accessible tool to the field of precision agriculture, reducing reliance on manual inspection and promoting sustainable disease management practices. By merging deep learning with practical agronomic support, this work underscores the potential of CNNs to transform crop health monitoring and enhance food production resilience on a global scale.
中文摘要:本研究开发了一种基于卷积神经网络的图像分类系统,能够自动识别叶片图像中的八种常见作物病害并提供治理建议,通过高训练精度和移动端部署为精准农业提供支持。
English Summary: This research develops a CNN-based image classification system that automatically detects eight common crop diseases from leaf images and provides treatment recommendations, achieving high training accuracy and mobile deployment to support farmers in precision agriculture.
Authors:Svetlana Orlova, Tommie Kerssies, Brunó B. Englert, Gijs Dubbelman
Abstract:
Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) advanced pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity. We release our code, domain-adapted encoders, and fine-tuned models to support future work: https://github.com/tue-mps/simple-tad.
Chinese: 最新研究表明,采用视频视觉变换器的简单编码器模型,在结合自监督掩码视频建模和领域自适应预训练等先进技术后,能在以自我为中心的交通异常检测中超越复杂专用方法,同时显著提升效率。
English: Recent research demonstrates that simple encoder-only models using Video Vision Transformers, when enhanced with advanced pre-training like self-supervised Masked Video Modeling and domain-adaptive techniques, can outperform complex specialized methods in ego-centric Traffic Anomaly Detection while being more efficient.
Authors:Wencan Huang, Daizong Liu, Wei Hu
Abstract:
While 3D Multi-modal Large Language Models (MLLMs) demonstrate remarkable scene understanding capabilities, their practical deployment faces critical challenges due to computational inefficiency. The key bottleneck stems from processing excessive object-centric visual tokens required for comprehensive 3D scene representation. Although visual token pruning has shown promise in accelerating 2D MLLMs, its applicability to 3D domains remains largely unexplored due to fundamental disparities in token structures. In this paper, we reveal two critical insights: (1) Significant redundancy exists in object-level 3D token representations, analogous to patch-level redundancy in 2D systems; (2) Global attention patterns exhibit strong predictive power for identifying non-essential tokens in 3D contexts. Building on these observations, we propose Fast3D, a plug-and-play visual token pruning framework for 3D MLLMs featuring two technical innovations: (1) Global Attention Prediction (GAP), where a lightweight neural network learns to predict the global attention distributions of the target model, enabling efficient token importance estimation for precise pruning guidance; (2) Sample-Adaptive visual token Pruning (SAP), which introduces dynamic token budgets through attention-based complexity assessment, automatically adjusting layer-wise pruning ratios based on input characteristics. Both of these two techniques operate without modifying the parameters of the target model. Extensive evaluations across five benchmarks validate the effectiveness of Fast3D, particularly under high visual token pruning ratios. Code is available at https://github.com/wencan25/Fast3D
中文:Fast3D框架通过全局注意力预测和自适应剪枝技术,在不改变模型参数的情况下有效减少3D多模态大语言模型中的冗余视觉标记,显著提升了计算效率并在高剪枝率下保持优异性能。
English: The Fast3D framework addresses computational inefficiency in 3D MLLMs by introducing global attention prediction and sample-adaptive pruning to reduce redundant visual tokens without altering model parameters, achieving strong performance under high pruning ratios.
Authors:Linus Walter, Qingkai Kong, Sara Hanson-Hedgecock, VÃctor Vilarrasa
Abstract:
Accurate representation of wells is essential for reliable reservoir characterization and simulation of operational scenarios in subsurface flow models. Physics-informed neural networks (PINNs) have recently emerged as a promising method for reservoir modeling, offering seamless integration of monitoring data and governing physical equations. However, existing PINN-based studies face major challenges in capturing fluid pressure near wells, particularly during the early stage after injection begins. To address this, we propose WellPINN, a modeling workflow that combines the outputs of multiple sequentially trained PINN models to accurately represent wells. This workflow iteratively approximates the radius of the equivalent well to match the actual well dimensions by decomposing the domain into stepwise shrinking subdomains with a simultaneously reducing equivalent well radius. Our results demonstrate that sequential training of superimposing networks around the pumping well is the first workflow that focuses on accurate inference of fluid pressure from pumping rates throughout the entire injection period, significantly advancing the potential of PINNs for inverse modeling and operational scenario simulations. All data and code for this paper will be made openly available at https://github.com/linuswalter/WellPINN.
中文摘要:提出的WellPINN工作流程通过顺序训练的物理信息神经网络和逐步缩小的子域划分,实现了整个注入期间井周流体压力的精确模拟,显著提升了物理信息神经网络在储层模拟中的逆向建模能力。
English Summary: The proposed WellPINN workflow uses sequentially trained physics-informed neural networks with stepwise domain decomposition to accurately model fluid pressure near wells throughout injection periods, advancing PINN capabilities for reservoir simulation.
Authors:Han Zhu, Wei Kang, Liyong Guo, Zengwei Yao, Fangjun Kuang, Weiji Zhuang, Zhaoqing Li, Zhifeng Han, Dong Zhang, Xin Zhang, Xingchen Song, Long Lin, Daniel Povey
Abstract:
Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at https://github.com/k2-fsa/ZipVoice.
中文: ZipVoice-Dialog是一种基于流匹配的非自回归模型,通过说话人轮转嵌入和课程学习策略实现高效的零样本语音对话生成,并基于新构建的OpenDialog数据集在各项指标上展现出优越性能。
English: ZipVoice-Dialog is a non-autoregressive flow matching model that enables efficient zero-shot spoken dialogue generation with precise speaker turn-taking and stereo output, supported by the newly curated OpenDialog dataset and achieving superior performance across multiple metrics.
Authors:Yueqian Wang, Xiaojun Meng, Yifan Wang, Huishuai Zhang, Dongyan Zhao
Abstract:
With the growing research focus on multimodal dialogue systems, the capability for proactive interaction is gradually gaining recognition. As an alternative to conventional turn-by-turn dialogue, users increasingly expect multimodal systems to be more initiative, for example, by autonomously determining the timing of multi-turn responses in real time during video playback. To facilitate progress in this emerging area, we introduce ProactiveVideoQA, the first comprehensive benchmark to evaluate a system's ability to engage in proactive interaction. Since model responses are generated at varying timestamps, we further propose PAUC, the first metric that accounts for the temporal dynamics of model responses. This enables a more accurate evaluation of systems operating in proactive settings. Through extensive benchmarking of various baseline systems on ProactiveVideoQA and a user study of human preferences, we show that PAUC is in better agreement with human preferences than traditional evaluation metrics, which typically only consider the textual content of responses. These findings demonstrate that PAUC provides a more faithful assessment of user experience in proactive interaction scenarios. Project homepage: https://github.com/yellow-binary-tree/ProactiveVideoQA
中文摘要:提出了ProactiveVideoQA基准和PAUC评估指标,用以衡量多模态对话系统的主动交互能力,其中PAUC通过考量响应时间动态,被证实更符合人类偏好。
English Summary: The ProactiveVideoQA benchmark and PAUC metric are introduced to evaluate multimodal dialogue systems' proactive interaction capabilities, with PAUC proving more aligned with human preferences by accounting for response timing dynamics.
Authors:Zile Wang, Hao Yu, Jiabo Zhan, Chun Yuan
Abstract:
Recent advances in latent diffusion models have achieved remarkable results in high-fidelity RGB image synthesis by leveraging pretrained VAEs to compress and reconstruct pixel data at low computational cost. However, the generation of transparent or layered content (RGBA image) remains largely unexplored, due to the lack of large-scale benchmarks. In this work, we propose ALPHA, the first comprehensive RGBA benchmark that adapts standard RGB metrics to four-channel images via alpha blending over canonical backgrounds. We further introduce ALPHAVAE, a unified end-to-end RGBA VAE that extends a pretrained RGB VAE by incorporating a dedicated alpha channel. The model is trained with a composite objective that combines alpha-blended pixel reconstruction, patch-level fidelity, perceptual consistency, and dual KL divergence constraints to ensure latent fidelity across both RGB and alpha representations. Our RGBA VAE, trained on only 8K images in contrast to 1M used by prior methods, achieves a +4.9 dB improvement in PSNR and a +3.2% increase in SSIM over LayerDiffuse in reconstruction. It also enables superior transparent image generation when fine-tuned within a latent diffusion framework. Our code, data, and models are released on https://github.com/o0o0o00o0/AlphaVAE for reproducibility.
中文: 本文提出了首个全面的RGBA基准ALPHA和统一的RGBA变分自编码器ALPHAVAE,该模型仅用极少量训练数据就在透明图像重建和生成方面显著超越了现有方法。
English: This paper introduces ALPHA, the first comprehensive RGBA benchmark, and ALPHAVAE, a unified RGBA variational autoencoder that significantly outperforms existing methods in transparent image reconstruction and generation despite using minimal training data.
Authors:Zhiwei Xu
Abstract:
Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A* (DAA*), by incorporating the proposed path angular freedom (PAF) into A* to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA* improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA* over neural A* in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by 9.0% SPR, 6.9% ASIM, and 3.9% PSIM. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA* significantly outperforms the state-of-the-art TransPath by 6.3% SPR, 6.0% PSIM, and 3.7% ASIM. We also discuss the minor trade-off between path optimality and search efficiency where applicable. Our code and model weights are available at https://github.com/zwxu064/DAAStar.git.
中文: 本文提出深度角度A*(DAA*)方法,通过引入路径角度自由度来优化路径平滑度与专家路径的相似性,在多个数据集上显著提升了路径相似性指标。
English: This paper introduces Deep Angular A* (DAA*), a novel method that enhances path imitation learning by incorporating path angular freedom to optimize smoothness and similarity to expert paths, achieving significant improvements in path metrics across multiple datasets.
Authors:Abdulvahap Mutlu, Åengül DoÄan, Türker Tuncer
Abstract:
The remarkable representational power of Vision Transformers (ViTs) remains underutilized in few-shot image classification. In this work, we introduce ViT-ProtoNet, which integrates a ViT-Small backbone into the Prototypical Network framework. By averaging class conditional token embeddings from a handful of support examples, ViT-ProtoNet constructs robust prototypes that generalize to novel categories under 5-shot settings. We conduct an extensive empirical evaluation on four standard benchmarks: Mini-ImageNet, FC100, CUB-200, and CIFAR-FS, including overlapped support variants to assess robustness. Across all splits, ViT-ProtoNet consistently outperforms CNN-based prototypical counterparts, achieving up to a 3.2\% improvement in 5-shot accuracy and demonstrating superior feature separability in latent space. Furthermore, it outperforms or is competitive with transformer-based competitors using a more lightweight backbone. Comprehensive ablations examine the impact of transformer depth, patch size, and fine-tuning strategy. To foster reproducibility, we release code and pretrained weights. Our results establish ViT-ProtoNet as a powerful, flexible approach for few-shot classification and set a new baseline for transformer-based meta-learners.
中文: ViT-ProtoNet通过将视觉Transformer骨干网络融入原型网络,在小样本图像分类中显著提升了准确率和特征可分性,在多个基准测试中以轻量级架构确立了新基线。
English: ViT-ProtoNet enhances few-shot image classification by integrating a Vision Transformer backbone into Prototypical Networks, achieving superior accuracy and feature separability across multiple benchmarks with a lightweight architecture.
Authors:Chenhao Ding, Jiangtao Zhang, Zongsheng Yue, Hui Wang, Qian Zhao, Deyu Meng
Abstract:
Deep prior-based approaches have demonstrated remarkable success in blind motion deblurring (BMD) recently. These methods, however, are often limited by the high non-convexity of the underlying optimization process in BMD, which leads to extreme sensitivity to the initial blur kernel. To address this issue, we propose a novel framework for BMD that leverages a deep generative model to encode the kernel prior and induce a better initialization for the blur kernel. Specifically, we pre-train a kernel generator based on a generative adversarial network (GAN) to aptly characterize the kernel's prior distribution, as well as a kernel initializer to provide a well-informed and high-quality starting point for kernel estimation. By combining these two components, we constrain the BMD solution within a compact latent kernel manifold, thus alleviating the aforementioned sensitivity for kernel initialization. Notably, the kernel generator and initializer are designed to be easily integrated with existing BMD methods in a plug-and-play manner, enhancing their overall performance. Furthermore, we extend our approach to tackle blind non-uniform motion deblurring without the need for additional priors, achieving state-of-the-art performance on challenging benchmark datasets. The source code is available at https://github.com/dch0319/GLKM-Deblur.
中文摘要:本文提出了一种新颖的盲运动去模糊框架,通过基于GAN的核生成器和初始化器提供更优的核初始化,有效降低对初始条件的敏感性,并在多个基准数据集上实现了最先进的性能。
English Summary: This paper introduces a novel blind motion deblurring framework that uses a GAN-based kernel generator and initializer to provide better kernel initialization, reducing sensitivity to initial conditions while achieving state-of-the-art performance.
Authors:Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel
Abstract:
Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.
Chinese: Prompt4Trust是一种强化学习框架,旨在提升多模态大语言模型的置信度校准与任务准确性,尤其关注临床决策的可信度,并在医学视觉问答基准中实现了领先性能。
English: Prompt4Trust is a reinforcement learning framework that enhances multimodal large language models' confidence calibration and task accuracy, particularly for trustworthy clinical decision-making, achieving state-of-the-art performance in medical visual question answering.
Authors:Shuhan Ye, Yuanbin Qian, Chong Wang, Sunqi Lin, Jiazhen Xu, Jiangbo Qian, Yuqi Li
Abstract:
Recently, Spiking Neural Networks (SNNs) have demonstrated rich potential in computer vision domain due to their high biological plausibility, event-driven characteristic and energy-saving efficiency. Still, limited annotated event-based datasets and immature SNN architectures result in their performance inferior to that of Artificial Neural Networks (ANNs). To enhance the performance of SNNs on their optimal data format, DVS data, we explore using RGB data and well-performing ANNs to implement knowledge distillation. In this case, solving cross-modality and cross-architecture challenges is necessary. In this paper, we propose cross knowledge distillation (CKD), which not only leverages semantic similarity and sliding replacement to mitigate the cross-modality challenge, but also uses an indirect phased knowledge distillation to mitigate the cross-architecture challenge. We validated our method on main-stream neuromorphic datasets, including N-Caltech101 and CEP-DVS. The experimental results show that our method outperforms current State-of-the-Art methods. The code will be available at https://github.com/ShawnYE618/CKD
Chinese: 脉冲神经网络在计算机视觉领域潜力巨大,但因标注数据有限和架构不成熟而性能不及人工神经网络,为此提出跨知识蒸馏方法,通过语义相似性和间接分阶段蒸馏解决跨模态和跨架构问题,在主流神经形态数据集上表现优于现有最优方法。
English: Spiking Neural Networks (SNNs) show promise in computer vision but lag behind Artificial Neural Networks (ANNs) due to limited datasets and immature architectures, leading to the proposal of cross knowledge distillation (CKD) that addresses cross-modality and cross-architecture challenges and outperforms state-of-the-art methods on neuromorphic datasets.
Authors:Junyu Chen, Yihua Gao, Mingyuan Ge, Mingyong Li
Abstract:
Image-text matching is crucial for bridging the semantic gap between computer vision and natural language processing. However, existing methods still face challenges in handling high-order associations and semantic ambiguities among similar instances. These ambiguities arise from subtle differences between soft positive samples (semantically similar but incorrectly labeled) and soft negative samples (locally matched but globally inconsistent), creating matching uncertainties. Furthermore, current methods fail to fully utilize the neighborhood relationships among semantically similar instances within training batches, limiting the model's ability to learn high-order shared knowledge. This paper proposes the Ambiguity-Aware and High-order Relation learning framework (AAHR) to address these issues. AAHR constructs a unified representation space through dynamic clustering prototype contrastive learning, effectively mitigating the soft positive sample problem. The framework introduces global and local feature extraction mechanisms and an adaptive aggregation network, significantly enhancing full-grained semantic understanding capabilities. Additionally, AAHR employs intra-modal and inter-modal correlation matrices to investigate neighborhood relationships among sample instances thoroughly. It incorporates GNN to enhance semantic interactions between instances. Furthermore, AAHR integrates momentum contrastive learning to expand the negative sample set. These combined strategies significantly improve the model's ability to discriminate between features. Experimental results demonstrate that AAHR outperforms existing state-of-the-art methods on Flickr30K, MSCOCO, and ECCV Caption datasets, considerably improving the accuracy and efficiency of image-text matching. The code and model checkpoints for this research are available at https://github.com/Image-Text-Matching/AAHR .
中文摘要:AAHR框架通过动态聚类原型对比学习解决语义模糊问题,并采用多模态关联分析与图神经网络增强特征判别能力,在多个基准数据集上实现了最优的图像-文本匹配性能。
English Summary: The AAHR framework addresses image-text matching challenges by mitigating semantic ambiguities through dynamic clustering prototype contrastive learning and enhancing feature discrimination using multi-modal correlation analysis with GNN integration, achieving state-of-the-art performance on benchmark datasets.
Authors:Shiyi Mu, Zichong Gu, Hanqi Lyu, Yilin Gao, Shugong Xu
Abstract:
3D detection technology is widely used in the field of autonomous driving, with its application scenarios gradually expanding from enclosed highways to open conventional roads. For rare anomaly categories that appear on the road, 3D detection models trained on closed sets often misdetect or fail to detect anomaly objects. To address this risk, it is necessary to enhance the generalization ability of 3D detection models for targets of arbitrary shapes and to possess the capability to filter out anomalies. The generalization of 3D detection is limited by two factors: the coupled training of 2D and 3D, and the insufficient diversity in the scale distribution of training samples. This paper proposes a Stereo-based 3D Anomaly object Detection (S3AD) algorithm, which decouples the training strategy of 3D and 2D to release the generalization ability for arbitrary 3D foreground detection, and proposes an anomaly scoring algorithm based on foreground confidence prediction, achieving target-level anomaly scoring. In order to further verify and enhance the generalization of anomaly detection, we use a 3D rendering method to synthesize two augmented reality binocular stereo 3D detection datasets which named KITTI-AR. KITTI-AR extends upon KITTI by adding 97 new categories, totaling 6k pairs of stereo images. The KITTI-AR-ExD subset includes 39 common categories as extra training data to address the sparse sample distribution issue. Additionally, 58 rare categories form the KITTI-AR-OoD subset, which are not used in training to simulate zero-shot scenarios in real-world settings, solely for evaluating 3D anomaly detection. Finally, the performance of the algorithm and the dataset is verified in the experiments. (Code and dataset can be obtained at https://github.com/shiyi-mu/S3AD-Code).
Chinese: 本文提出S3AD算法,通过解耦2D和3D训练并采用新的评分方法,增强了自动驾驶中的3D异常检测能力,并在合成的KITTI-AR数据集上进行了验证。
English: This paper introduces the S3AD algorithm, which enhances 3D anomaly detection in autonomous driving by decoupling 2D and 3D training and using a novel scoring method, validated on the synthesized KITTI-AR dataset.
Authors:Dunsheng Huang, Dong Shen, Lei Lu, Ying Tan
Abstract:
Wavelet neural network (WNN), which learns an unknown nonlinear mapping from the data, has been widely used in signal processing, and time-series analysis. However, challenges in constructing accurate wavelet bases and high computational costs limit their application. This study introduces a constructive WNN that selects initial bases and trains functions by introducing new bases for predefined accuracy while reducing computational costs. For the first time, we analyze the frequency of unknown nonlinear functions and select appropriate initial wavelets based on their primary frequency components by estimating the energy of the spatial frequency component. This leads to a novel constructive framework consisting of a frequency estimator and a wavelet-basis increase mechanism to prioritize high-energy bases, significantly improving computational efficiency. The theoretical foundation defines the necessary time-frequency range for high-dimensional wavelets at a given accuracy. The framework's versatility is demonstrated through four examples: estimating unknown static mappings from offline data, combining two offline datasets, identifying time-varying mappings from time-series data, and capturing nonlinear dependencies in real time-series data. These examples showcase the framework's broad applicability and practicality. All the code will be released at https://github.com/dshuangdd/CWNN.
中文摘要:本研究提出了一种构造性小波神经网络,通过基于频率分析选择初始小波并实施基函数增加机制来提高计算效率,在多种数据处理任务中展现了广泛适用性。
English Summary: This study introduces a constructive wavelet neural network that improves computational efficiency by selecting initial wavelets based on frequency analysis and implementing a basis increase mechanism, demonstrating broad applicability across various data processing tasks.
Authors:Jonas Scholz, Richard E. Turner
Abstract:
Iterative generative models, like diffusion and flow-matching, create high-fidelity samples by progressively refining a noise vector into data. However, this process is notoriously slow, often requiring hundreds of function evaluations. We introduce the warm-start model, a simple, deterministic model that dramatically accelerates conditional generation by providing a better starting point. Instead of starting generation from an uninformed N(0, I) prior, our warm-start model predicts an informed prior N(mu, sigma), whose moments are conditioned on the input context. This "warm start" substantially reduces the distance the generative process must traverse, particularly when the conditioning information is strongly informative. On tasks like image inpainting, our method achieves results competitive with a 1000-step DDPM baseline using only 11 total function evaluations (1 for the warm start, 10 for generation). A simple conditional normalization trick makes our method compatible with any standard generative model and sampler without modification, allowing it to be combined with other efficient sampling techniques for further acceleration. Our implementation is available at https://github.com/jonas-scholz123/warm-start-model.
Chinese: 预热启动模型通过提供基于输入条件的有信息先验,显著减少了生成过程所需的函数评估次数,仅用11次评估即可达到与1000步基准相媲美的效果。
English: The warm-start model accelerates conditional generation by providing an informed prior that reduces the number of function evaluations needed, achieving competitive results with only 11 evaluations compared to 1000-step baselines.
Authors:Minhaj Uddin Ahmad, Akid Abrar, Sagar Dasgupta, Mizanur Rahman
Abstract:
We introduce OpenCAMS (Open-Source Connected and Automated Mobility Co-Simulation Platform), an open-source, synchronized, and extensible co-simulation framework that tightly couples three best-in-class simulation tools: (i) SUMO, (ii) CARLA, and (iii) OMNeT++. OpenCAMS is designed to support advanced research in transportation safety, mobility, and cybersecurity by combining the strengths of each simulation domain. Specifically, SUMO provides large-scale, microscopic traffic modeling; CARLA offers high-fidelity 3D perception, vehicle dynamics, and control simulation; and OMNeT++ enables modular, event-driven network communication, such as cellular vehicle-to-everything (C-V2X). OpenCAMS employs a time-synchronized, bidirectional coupling architecture that ensures coherent simulation progression across traffic, perception, and communication domains while preserving modularity and reproducibility. For example, CARLA can simulate and render a subset of vehicles that require detailed sensor emulation and control logic; SUMO orchestrates network-wide traffic flow, vehicle routing, and traffic signal management; and OMNeT++ dynamically maps communication nodes to both mobile entities (e.g., vehicles) and static entities (e.g., roadside units) to enable C-V2X communication. While these three simulators form the foundational core of OpenCAMS, the platform is designed to be expandable and future-proof, allowing additional simulators to be integrated on top of this core without requiring fundamental changes to the system architecture. The OpenCAMS platform is fully open-source and publicly available through its GitHub repository https://github.com/minhaj6/carla-sumo-omnetpp-cosim, providing the research community with an accessible, flexible, and collaborative environment for advancing next-generation intelligent transportation systems.
中文摘要:OpenCAMS是一个开源协同仿真平台,通过整合SUMO、CARLA和OMNeT++三大仿真工具,以时间同步的耦合架构支持交通、感知与通信领域的综合研究,同时保持模块化特性和可扩展性。
English Summary: OpenCAMS is an open-source co-simulation platform integrating SUMO, CARLA, and OMNeT++ to advance transportation research through synchronized traffic, perception, and communication simulations while maintaining modularity and reproducibility.
Authors:Qiyan Zhao, Xiaofeng Zhang, Yiheng Li, Yun Xing, Xiaosong Yuan, Feilong Tang, Sinan Fan, Xuhang Chen, Xuyao Zhang, Dahan Wang
Abstract:
Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction's perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.
中文: 本文发现旋转位置编码中的长期衰减会导致大型视觉语言模型产生感知偏差从而引发幻觉,并提出MCA-LLaVA模型通过将空间衰减扩展到二维来改善多模态对齐。
English: This paper identifies that the long-term decay in Rotary Position Encoding causes biased perception in Large Vision Language Models, leading to hallucinations, and proposes MCA-LLaVA to mitigate this by extending spatial decay to two dimensions for better multimodal alignment.
Authors:Yongwei Jiang, Yixiong Zou, Yuhua Li, Ruixuan Li
Abstract:
Few-Shot Class-Incremental Learning (FSCIL) faces dual challenges of data scarcity and incremental learning in real-world scenarios. While pool-based prompting methods have demonstrated success in traditional incremental learning, their effectiveness in FSCIL settings remains unexplored. This paper presents the first study of current prompt pool methods in FSCIL tasks, revealing an unanticipated performance degradation in incremental sessions. Through comprehensive analysis, we identify that this phenomenon stems from token-dimension saturation: with limited data, excessive prompts compete for task-relevant information, leading to model overfitting. Based on this finding, we propose LGSP-Prompt (Local-Global Spatial Prompting), which innovatively shifts pool-based prompt learning from the token dimension to the spatial dimension. LGSP-Prompt generates spatial prompts by synergistically combining local spatial features and global frequency-domain representations to highlight key patterns in input images. We construct two spatial prompt pools enabling dynamic prompt selection to maintain acquired knowledge while effectively learning novel sessions. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across multiple FSCIL benchmarks, showing significant advantages in both base knowledge preservation and incremental learning. Our implementation is available at https://github.com/Jywsuperman/LGSP.
中文摘要:本文提出LGSP-Prompt方法,通过将提示学习从词元维度转向空间维度,结合局部空间特征和全局频域表示生成动态空间提示,有效解决了小样本类增量学习中的性能下降问题,在多个基准测试中实现了最先进的性能。
English Summary: This paper introduces LGSP-Prompt, a novel method that addresses performance degradation in Few-Shot Class-Incremental Learning by shifting prompt learning from token to spatial dimensions, achieving state-of-the-art results through dynamic spatial prompts combining local features and global frequency representations.
Authors:Han Ye, Yuqiang Jin, Jinyuan Liu, Tao Li, Wen-An Zhang, Minglei Fu
Abstract:
Accurate extrinsic calibration of multiple LiDARs is crucial for improving the foundational performance of three-dimensional (3D) map reconstruction systems. This paper presents a novel targetless extrinsic calibration framework for multi-LiDAR systems that does not rely on overlapping fields of view or precise initial parameter estimates. Unlike conventional calibration methods that require manual annotations or specific reference patterns, our approach introduces a unified optimization framework by integrating LiDAR bundle adjustment (LBA) optimization with robust iterative refinement. The proposed method constructs an accurate reference point cloud map via continuous scanning from the target LiDAR and sliding-window LiDAR bundle adjustment, while formulating extrinsic calibration as a joint LBA optimization problem. This method effectively mitigates cumulative mapping errors and achieves outlier-resistant parameter estimation through an adaptive weighting mechanism. Extensive evaluations in both the CARLA simulation environment and real-world scenarios demonstrate that our method outperforms state-of-the-art calibration techniques in both accuracy and robustness. Experimental results show that for non-overlapping sensor configurations, our framework achieves an average translational error of 5 mm and a rotational error of 0.2°, with an initial error tolerance of up to 0.4 m/30°. Moreover, the calibration process operates without specialized infrastructure or manual parameter tuning. The code is open source and available on GitHub (\underline{https://github.com/Silentbarber/DLBAcalib})
Chinese: 本文提出了一种新型多激光雷达无目标外参标定框架,通过激光雷达束调整优化实现了高精度和鲁棒性,在平移和旋转误差上分别达到5毫米和0.2度的优异表现,超越了现有标定方法。
English: This paper introduces a novel targetless extrinsic calibration framework for multi-LiDAR systems that achieves high accuracy and robustness through LiDAR bundle adjustment optimization, outperforming existing methods with average errors of 5 mm in translation and 0.2° in rotation.
Authors:Shuo Yang, Zijian Yu, Zhenzhe Ying, Yuqin Dai, Guoqing Wang, Jun Lan, Jinfeng Xu, Jinze Li, Edith C. H. Ngai
Abstract:
The rapid proliferation of multimodal misinformation presents significant challenges for automated fact-checking systems, especially when claims are ambiguous or lack sufficient context. We introduce RAMA, a novel retrieval-augmented multi-agent framework designed for verifying multimedia misinformation. RAMA incorporates three core innovations: (1) strategic query formulation that transforms multimodal claims into precise web search queries; (2) cross-verification evidence aggregation from diverse, authoritative sources; and (3) a multi-agent ensemble architecture that leverages the complementary strengths of multiple multimodal large language models and prompt variants. Extensive experiments demonstrate that RAMA achieves superior performance on benchmark datasets, particularly excelling in resolving ambiguous or improbable claims by grounding verification in retrieved factual evidence. Our findings underscore the necessity of integrating web-based evidence and multi-agent reasoning for trustworthy multimedia verification, paving the way for more reliable and scalable fact-checking solutions. RAMA will be publicly available at https://github.com/kalendsyang/RAMA.git.
中文:RAMA框架通过策略性查询构建、交叉验证证据整合及多智能体协同,利用检索到的事实证据显著提升了多媒体虚假信息的验证效果,尤其在处理模糊或可疑声明方面表现卓越。
English: The RAMA framework enhances multimedia misinformation verification by employing strategic query formulation, cross-verification evidence aggregation, and a multi-agent ensemble to achieve superior performance on ambiguous claims through retrieved factual evidence.
Authors:Ali Vosoughi, Ayoub Shahnazari, Yufeng Xi, Zeliang Zhang, Griffin Hess, Chenliang Xu, Niaz Abdolrahim
Abstract:
This work presents OPENXRD, an open-book pipeline designed for crystallography question answering, which integrates textual prompts with concise supporting content generated by GPT-4.5. Instead of using scanned textbooks, which may lead to copyright issues, OPENXRD generates compact, domain-specific references that help smaller models understand key concepts in X-ray diffraction (XRD). We evaluate OPENXRD on a well-defined set of 217 expert-level XRD questions by comparing different vision-language models, including GPT-4 and LLaVA-based frameworks such as Mistral, LLaMA, and QWEN, under both closed-book (without supporting material) and open-book (with supporting material) conditions. Our experimental results show significant accuracy improvements in models that use the GPT-4.5-generated summaries, particularly those with limited prior training in crystallography. OPENXRD uses knowledge from larger models to fill knowledge gaps in crystallography and shows that AI-generated texts can help smaller models reason more effectively in scientific tasks. While the current version of OPENXRD focuses on text-based inputs, we also explore future extensions such as adding real crystal diagrams or diffraction patterns to improve interpretation in specialized materials science contexts. Overall, OPENXRD shows that specialized open-book systems can be useful in materials science and provides a foundation for broader natural language processing (NLP) tools in critical scientific fields.
中文: OPENXRD是一种开放书式流程,通过GPT-4.5生成的摘要帮助较小模型在X射线衍射问题上显著提升回答准确率,既规避了版权问题,又为未来科学领域的自然语言处理工具奠定了基础。
English: OPENXRD is an open-book pipeline that enhances crystallography question answering by using GPT-4.5-generated summaries to help smaller models improve accuracy, particularly in X-ray diffraction tasks, while avoiding copyright issues and providing a foundation for future scientific NLP tools.
Authors:Zhimin Liao, Ping Wei, Ruijie Zhang, Shuaijia Chen, Haoxuan Wang, Ziyang Ren
Abstract:
Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offers substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose $I^{2}$-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, $I^{2}$-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high-level control over scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that $I^{2}$-World achieves state-of-the-art performance, outperforming existing methods by 25.1\% in mIoU and 36.9\% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency: it requires merely 2.9 GB of training memory and achieves real-time inference at 37.0 FPS. Our code is available on https://github.com/lzzzzzm/II-World.
中文: I²-World提出了一种高效的4D占据预测框架,通过解耦场景标记化为场景内与场景间组件,以25.1%的mIoU提升和37.0 FPS实时推理速度实现了最先进性能。
English: I²-World introduces an efficient 4D occupancy forecasting framework that decouples scene tokenization into intra-scene and inter-scene components, achieving state-of-the-art performance with 25.1% higher mIoU and 37.0 FPS real-time inference.
Authors:Dewen Zhang, Tahir Hussain, Wangpeng An, Hayaru Shouno
Abstract:
Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM's linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at https://github.com/Ody-trek/PoseLLM.
中文: PoseLLM采用非线性MLP视觉语言连接器替代LocLLM的线性投影器,通过增强跨模态特征融合,在提升姿态估计精度的同时保持了优秀的零样本泛化能力。
English: PoseLLM introduces a nonlinear MLP vision-language connector to replace the linear projector in LocLLM, enhancing cross-modal feature fusion for improved pose estimation accuracy and zero-shot generalization.
Authors:Linlan Huang, Xusheng Cao, Haori Lu, Yifan Meng, Fei Yang, Xialei Liu
Abstract:
Continual learning aims to enable models to learn sequentially from continuously incoming data while retaining performance on previously learned tasks. With the Contrastive Language-Image Pre-trained model (CLIP) exhibiting strong capabilities across various downstream tasks, there has been growing interest in leveraging CLIP for continual learning in such scenarios. Most existing works overlook the inherent modality gap in CLIP, a key factor in its generalization and adaptability. In this paper, we analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models. Our observations reveal that the modality gap effectively reflects the extent to which pre-trained knowledge is preserved. Based on these insights, we propose a simple yet effective method, MG-CLIP, that improves CLIP's performance in class-incremental learning. Our approach leverages modality gap preservation to mitigate forgetting and modality gap compensation to enhance the capacity for new data, introducing a novel modality-gap-based perspective for continual learning. Extensive experiments on multiple benchmarks demonstrate that our method outperforms existing approaches without requiring additional replay data. Our code is available at https://github.com/linlany/MindtheGap.
中文: 本文提出MG-CLIP方法,通过利用CLIP中的模态差距来保留预训练知识并适应新数据,从而在无需额外回放数据的情况下提升持续学习性能。
English: This paper introduces MG-CLIP, a method that leverages the modality gap in CLIP to enhance continual learning by preserving pre-trained knowledge and adapting to new data, achieving superior performance without extra replay data.
Authors:Di Wen, Kunyu Peng, Kailun Yang, Yufan Chen, Ruiping Liu, Junwei Zheng, Alina Roitberg, Danda Pani Paudel, Luc Van Gool, Rainer Stiefelhagen
Abstract:
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate prediction. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusions, and noise. Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the HOI field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, thus dynamically adjusting the model's optimization to enhance robust feature learning. Extensive experiments show that our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code will be made publicly available at https://github.com/Kratos-Wen/RoHOI.
中文: 本研究提出了首个用于人-物交互检测的鲁棒性基准RoHOI,并设计了一种基于语义感知掩码的渐进学习策略,有效提升了模型在现实干扰下的稳健性,性能优于现有方法。
English: The study introduces RoHOI, the first robustness benchmark for Human-Object Interaction detection, and proposes a Semantic-Aware Masking-based Progressive Learning strategy that significantly enhances model resilience against real-world corruptions, outperforming existing methods.
Authors:Gianluigi Silvestri, Luca Ambrogioni
Abstract:
Current state-of-the-art generative approaches frequently rely on a two-stage training procedure, where an autoencoder (often a VAE) first performs dimensionality reduction, followed by training a generative model on the learned latent space. While effective, this introduces computational overhead and increased sampling times. We challenge this paradigm by proposing Consistency Training of Variational AutoEncoders (CoVAE), a novel single-stage generative autoencoding framework that adopts techniques from consistency models to train a VAE architecture. The CoVAE encoder learns a progressive series of latent representations with increasing encoding noise levels, mirroring the forward processes of diffusion and flow matching models. This sequence of representations is regulated by a time dependent $β$ parameter that scales the KL loss. The decoder is trained using a consistency loss with variational regularization, which reduces to a conventional VAE loss at the earliest latent time. We show that CoVAE can generate high-quality samples in one or few steps without the use of a learned prior, significantly outperforming equivalent VAEs and other single-stage VAEs methods. Our approach provides a unified framework for autoencoding and diffusion-style generative modeling and provides a viable route for one-step generative high-performance autoencoding. Our code is publicly available at https://github.com/gisilvs/covae.
中文: 作者提出CoVAE这一单阶段生成式自编码框架,将一致性模型技术融入VAE架构,无需学习先验分布即可通过少量步骤生成高质量样本,性能显著优于现有方法。
English: The authors propose CoVAE, a single-stage generative autoencoding framework that integrates consistency model techniques into a VAE architecture, enabling high-quality sample generation in few steps without a learned prior and outperforming existing methods.
Authors:Yiyang Chen, Shanshan Zhao, Lunhao Duan, Changxing Ding, Dacheng Tao
Abstract:
Diffusion-based models, widely used in text-to-image generation, have proven effective in 2D representation learning. Recently, this framework has been extended to 3D self-supervised learning by constructing a conditional point generator for enhancing 3D representations. However, its performance remains constrained by the 3D diffusion model, which is trained on the available 3D datasets with limited size. We hypothesize that the robust capabilities of text-to-image diffusion models, particularly Stable Diffusion (SD), which is trained on large-scale datasets, can help overcome these limitations. To investigate this hypothesis, we propose PointSD, a framework that leverages the SD model for 3D self-supervised learning. By replacing the SD model's text encoder with a 3D encoder, we train a point-to-image diffusion model that allows point clouds to guide the denoising of rendered noisy images. With the trained point-to-image diffusion model, we use noise-free images as the input and point clouds as the condition to extract SD features. Next, we train a 3D backbone by aligning its features with these SD features, thereby facilitating direct semantic learning. Comprehensive experiments on downstream point cloud tasks and ablation studies demonstrate that the SD model can enhance point cloud self-supervised learning. Code is publicly available at https://github.com/wdttt/PointSD.
中文摘要:PointSD框架利用预训练的Stable Diffusion模型,通过训练点云到图像的扩散模型将三维点云特征与语义图像特征对齐,从而增强三维自监督学习。
English Summary: PointSD is a framework that leverages the pre-trained Stable Diffusion model to enhance 3D self-supervised learning by training a point-to-image diffusion model that aligns 3D point cloud features with semantic image features.
Authors:Anthony Miyaguchi, Conor Johnston, Aaryan Potdar
Abstract:
Large Language Models (LLMs) demonstrate strong conversational abilities. In this Working Paper, we study them in the context of debating in two ways: their ability to perform in a structured debate along with a dataset of arguments to use and their ability to evaluate utterances throughout the debate. We deploy six leading publicly available models from three providers for the Retrieval-Augmented Debate and Evaluation. The evaluation is performed by measuring four key metrics: Quality, Quantity, Manner, and Relation. Throughout this task, we found that although LLMs perform well in debates when given related arguments, they tend to be verbose in responses yet consistent in evaluation. The accompanying source code for this paper is located at https://github.com/dsgt-arc/touche-2025-rad.
中文: 本研究评估了六种主流大语言模型在结构化辩论中的表现,发现模型在获得相关论据时表现良好,但回答冗长,同时在质量、数量、方式和相关性四个维度的评估中保持一致性。
English: This study evaluates six leading LLMs in structured debates, finding they perform well with relevant arguments but tend to be verbose while maintaining consistent evaluation across quality, quantity, manner, and relation metrics.
Authors:Esraa Elelimy, Brett Daley, Andrew Patterson, Marlos C. Machado, Adam White, Martha White
Abstract:
Achieving fast and stable off-policy learning in deep reinforcement learning (RL) is challenging. Most existing methods rely on semi-gradient temporal-difference (TD) methods for their simplicity and efficiency, but are consequently susceptible to divergence. While more principled approaches like Gradient TD (GTD) methods have strong convergence guarantees, they have rarely been used in deep RL. Recent work introduced the generalized Projected Bellman Error ($\overline{\text{PBE}}$), enabling GTD methods to work efficiently with nonlinear function approximation. However, this work is limited to one-step methods, which are slow at credit assignment and require a large number of samples. In this paper, we extend the generalized $\overline{\text{PBE}}$ objective to support multistep credit assignment based on the $λ$-return and derive three gradient-based methods that optimize this new objective. We provide both a forward-view formulation compatible with experience replay and a backward-view formulation compatible with streaming algorithms. Finally, we evaluate the proposed algorithms and show that they outperform both PPO and StreamQ in MuJoCo and MinAtar environments, respectively. Code available at https://github.com/esraaelelimy/gtd\_algos
中文摘要:本文将广义投影贝尔曼误差扩展至基于λ回报的多步信用分配,提出了三种梯度优化方法,在MuJoCo和MinAtar环境中均优于现有算法。
English Summary: This paper extends the generalized Projected Bellman Error to multistep credit assignment using λ-return, developing three gradient-based methods that outperform existing algorithms in MuJoCo and MinAtar environments.
Authors:Frédéric A. Dreyer, Jan Ludwiczak, Karolis Martinkus, Brennan Abanades, Robert G. Alberstein, Pan Kessel, Pranav Rao, Jae Hyeon Lee, Richard Bonneau, Andrew M. Watkins, Franziska Seeger
Abstract:
We introduce Ibex, a pan-immunoglobulin structure prediction model that achieves state-of-the-art accuracy in modeling the variable domains of antibodies, nanobodies, and T-cell receptors. Unlike previous approaches, Ibex explicitly distinguishes between bound and unbound protein conformations by training on labeled apo and holo structural pairs, enabling accurate prediction of both states at inference time. Using a comprehensive private dataset of high-resolution antibody structures, we demonstrate superior out-of-distribution performance compared to existing specialized and general protein structure prediction tools. Ibex combines the accuracy of cutting-edge models with significantly reduced computational requirements, providing a robust foundation for accelerating large molecule design and therapeutic development.
中文: Ibex是一种先进的泛免疫球蛋白结构预测模型,通过区分结合与未结合构象,能精确预测抗体、纳米抗体及T细胞受体的可变域结构,在显著降低计算需求的同时提供卓越性能。
English: Ibex is a state-of-the-art pan-immunoglobulin structure prediction model that accurately models variable domains of antibodies, nanobodies, and T-cell receptors by distinguishing between bound and unbound conformations, offering superior performance with reduced computational needs.
Authors:Zhengxiao He, Huayu Li, Geng Yuan, William D. S. Killgore, Stuart F. Quan, Chen X. Chen, Ao Li
Abstract:
Methods: We developed a self-supervised deep learning model that extracts meaningful patterns from multi-modal signals (Electroencephalography (EEG), Electrocardiography (ECG), and respiratory signals). The model was trained on data from 4,398 participants. Projection scores were derived by contrasting embeddings from individuals with and without CVD outcomes. External validation was conducted in an independent cohort with 1,093 participants. The source code is available on https://github.com/miraclehetech/sleep-ssl. Results: The projection scores revealed distinct and clinically meaningful patterns across modalities. ECG-derived features were predictive of both prevalent and incident cardiac conditions, particularly CVD mortality. EEG-derived features were predictive of incident hypertension and CVD mortality. Respiratory signals added complementary predictive value. Combining these projection scores with the Framingham Risk Score consistently improved predictive performance, achieving area under the curve values ranging from 0.607 to 0.965 across different outcomes. Findings were robustly replicated and validated in the external testing cohort. Conclusion: Our findings demonstrate that the proposed framework can generate individualized CVD risk scores directly from PSG data. The resulting projection scores have the potential to be integrated into clinical practice, enhancing risk assessment and supporting personalized care.
中文: 本研究开发了一种自监督深度学习模型,能从多模态睡眠信号中提取具有临床意义的特征,结合传统风险评分可显著提升心血管疾病预测效能,并在外部验证中表现出稳健性能。
English: A self-supervised deep learning model was developed to extract clinically meaningful patterns from multi-modal sleep signals, which when combined with traditional risk scores significantly improved cardiovascular disease prediction and demonstrated robust external validation.
Authors:Hanene F. Z. Brachemi Meftah, Wassim Hamidouche, Sid Ahmed Fezza, Olivier Déforges
Abstract:
Recent years have witnessed remarkable progress in developing Vision-Language Models (VLMs) capable of processing both textual and visual inputs. These models have demonstrated impressive performance, leading to their widespread adoption in various applications. However, this widespread raises serious concerns regarding user privacy, particularly when models inadvertently process or expose private visual information. In this work, we frame the preservation of privacy in VLMs as an adversarial attack problem. We propose a novel attack strategy that selectively conceals information within designated Region Of Interests (ROIs) in an image, effectively preventing VLMs from accessing sensitive content while preserving the semantic integrity of the remaining image. Unlike conventional adversarial attacks that often disrupt the entire image, our method maintains high coherence in unmasked areas. Experimental results across three state-of-the-art VLMs namely LLaVA, Instruct-BLIP, and BLIP2-T5 demonstrate up to 98% reduction in detecting targeted ROIs, while maintaining global image semantics intact, as confirmed by high similarity scores between clean and adversarial outputs. We believe that this work contributes to a more privacy conscious use of multimodal models and offers a practical tool for further research, with the source code publicly available at: https://github.com/hbrachemi/Vlm_defense-attack.
Chinese: 本研究提出了一种新颖的对抗性攻击方法,通过选择性遮蔽图像中的敏感区域来保护用户隐私免受视觉语言模型的侵害,同时保持图像整体语义完整性,在三种先进VLM上实现了目标检测率最高降低98%的效果。
English: This study introduces a novel adversarial attack method that selectively conceals sensitive regions in images to protect user privacy from Vision-Language Models while maintaining overall image semantics, achieving up to 98% reduction in targeted detection across three advanced VLMs.
Authors:Chenyu Wang, Cai Zhou, Sharut Gupta, Zongyu Lin, Stefanie Jegelka, Stephen Bates, Tommi Jaakkola
Abstract:
Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at https://github.com/ChenyuWang-Monica/REED.
Chinese: 本文提出了一种系统框架,通过引入表征指导来增强扩散模型,在多个任务中提升了生成质量并加速训练,如在ImageNet上实现了23.3倍的训练加速。
English: This paper introduces a systematic framework to enhance diffusion models by incorporating representation guidance, which accelerates training and improves generation quality across various tasks, as demonstrated by a 23.3 times faster training speed on ImageNet.
Authors:Mahdiyar Molahasani, Azadeh Motamedi, Michael Greenspan, Il-Min Kim, Ali Etemad
Abstract:
We introduce Projection-based Reduction of Implicit Spurious bias in vision-language Models (PRISM), a new data-free and task-agnostic solution for bias mitigation in VLMs like CLIP. VLMs often inherit and amplify biases in their training data, leading to skewed predictions. PRISM is designed to debias VLMs without relying on predefined bias categories or additional external data. It operates in two stages: first, an LLM is prompted with simple class prompts to generate scene descriptions that contain spurious correlations. Next, PRISM uses our novel contrastive-style debiasing loss to learn a projection that maps the embeddings onto a latent space that minimizes spurious correlations while preserving the alignment between image and text embeddings.Extensive experiments demonstrate that PRISM outperforms current debiasing methods on the commonly used Waterbirds and CelebA datasets We make our code public at: https://github.com/MahdiyarMM/PRISM.
Chinese: PRISM是一种无需外部数据且任务无关的新方法,通过使用大语言模型生成包含虚假相关性的场景描述,并应用对比式去偏损失学习投影映射,在保持图像-文本对齐的同时有效减少视觉语言模型中的偏见。
English: PRISM is a novel data-free and task-agnostic method that mitigates bias in vision-language models by using an LLM to generate scene descriptions with spurious correlations and applying a contrastive debiasing loss to learn a projection that reduces these biases while maintaining image-text alignment.
Authors:Xiaowen Zhang, Zhenyu Bi, Patrick Lachance, Xuan Wang, Tiziana Di Matteo, Rupert A. C. Croft
Abstract:
As cosmological simulations and their associated software become increasingly complex, physicists face the challenge of searching through vast amounts of literature and user manuals to extract simulation parameters from dense academic papers, each using different models and formats. Translating these parameters into executable scripts remains a time-consuming and error-prone process. To improve efficiency in physics research and accelerate the cosmological simulation process, we introduce SimAgents, a multi-agent system designed to automate both parameter configuration from the literature and preliminary analysis for cosmology research. SimAgents is powered by specialized LLM agents capable of physics reasoning, simulation software validation, and tool execution. These agents collaborate through structured communication, ensuring that extracted parameters are physically meaningful, internally consistent, and software-compliant. We also construct a cosmological parameter extraction evaluation dataset by collecting over 40 simulations in published papers from Arxiv and leading journals that cover diverse simulation types. Experiments on the dataset demonstrate a strong performance of SimAgents, highlighting its effectiveness and potential to accelerate scientific research for physicists. Our demonstration video is available at: https://youtu.be/w1zLpm_CaWA. The complete system and dataset are publicly available at https://github.com/xwzhang98/SimAgents.
中文摘要:SimAgents是一个多智能体系统,能够自动从学术文献中提取宇宙学模拟参数并生成可执行脚本,通过确保物理一致性和软件兼容性,显著提升了物理研究的效率。
English Summary: SimAgents is a multi-agent system that automates the extraction and validation of cosmological simulation parameters from academic literature and generates executable scripts, significantly improving research efficiency by ensuring physical consistency and software compliance.
Authors:Tomasz Szandala, Fatima Ezzeddine, Natalia Rusin, Silvia Giordano, Omran Ayoub
Abstract:
Artificial Intelligence-generated content has become increasingly popular, yet its malicious use, particularly the deepfakes, poses a serious threat to public trust and discourse. While deepfake detection methods achieve high predictive performance, they often exhibit biases across demographic attributes such as ethnicity and gender. In this work, we tackle the challenge of fair deepfake detection, aiming to mitigate these biases while maintaining robust detection capabilities. To this end, we propose a novel post-processing approach, referred to as Fairness-Oriented Final Layer Input Prioritising (Fair-FLIP), that reweights a trained model's final-layer inputs to reduce subgroup disparities, prioritising those with low variability while demoting highly variable ones. Experimental results comparing Fair-FLIP to both the baseline (without fairness-oriented de-biasing) and state-of-the-art approaches show that Fair-FLIP can enhance fairness metrics by up to 30% while maintaining baseline accuracy, with only a negligible reduction of 0.25%.
Code is available on Github: https://github.com/szandala/fair-deepfake-detection-toolbox
中文: 本研究提出Fair-FLIP后处理方法,通过重新加权模型最终层输入,在保持基线准确率的同时将深度伪造检测的 demographic 偏差降低达30%,性能损失仅0.25%。
English: The study introduces Fair-FLIP, a post-processing method that reduces demographic biases in deepfake detection by up to 30% while preserving baseline accuracy with minimal performance loss.
Authors:Sergio Mares, Ariel Espinoza Weinberger, Nilah M. Ioannidis
Abstract:
Personalized vaccines and T-cell immunotherapies depend critically on identifying peptide-MHC class I (pMHC-I) interactions capable of eliciting potent immune responses. However, current benchmarks and models inherit biases present in mass-spectrometry and binding-assay datasets, limiting discovery of novel peptide ligands. To address this issue, we introduce a structure-guided benchmark of pMHC-I peptides designed using diffusion models conditioned on crystal structure interaction distances. Spanning twenty high-priority HLA alleles, this benchmark is independent of previously characterized peptides yet reproduces canonical anchor residue preferences, indicating structural generalization without experimental dataset bias. Using this resource, we demonstrate that state-of-the-art sequence-based predictors perform poorly at recognizing the binding potential of these structurally stable designs, indicating allele-specific limitations invisible in conventional evaluations. Our geometry-aware design pipeline yields peptides with high predicted structural integrity and higher residue diversity than existing datasets, representing a key resource for unbiased model training and evaluation. Our code, and data are available at: https://github.com/sermare/struct-mhc-dev.
中文: 本研究通过基于晶体结构距离的扩散模型,构建了不受实验数据偏差影响的pMHC-I肽段基准数据集,揭示了现有预测模型的局限性,为无偏见的免疫疗法开发提供了关键资源。
English: This study introduces a structure-guided benchmark for peptide-MHC class I interactions using diffusion models to overcome biases in existing datasets, revealing limitations in current predictors and providing a resource for unbiased immunotherapy development.
Authors:Wenliang Shan, Michael Fu, Rui Yang, Chakkrit Tantithamthavorn
Abstract:
Safety alignment is critical for LLM-powered systems. While recent LLM-powered guardrail approaches such as LlamaGuard achieve high detection accuracy of unsafe inputs written in English (e.g., ``How to create a bomb?''), they struggle with multilingual unsafe inputs. This limitation leaves LLM systems vulnerable to unsafe and jailbreak prompts written in low-resource languages such as those in Southeast Asia. This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages. It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts in LLM-powered systems. We adapt a general-purpose multilingual language model into a multilingual guardrail using low-rank adaptation (LoRA). We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages, including safe, unsafe, and jailbreak cases. We evaluate SEALGuard against state-of-the-art guardrails such as LlamaGuard on this benchmark. Our findings show that multilingual unsafe and jailbreak prompts substantially degrade the performance of the state-of-the-art LlamaGuard, which experiences a drop in Defense Success Rate (DSR) by 9% and 18%, respectively, compared to its performance on English-only prompts. In contrast, SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving DSR by 48% over LlamaGuard and achieving the best DSR, precision, and F1-score. Our ablation study further reveals the contributions of adaptation strategies and model size to the overall performance of SEALGuard. We release our pre-trained model and benchmark at https://github.com/awsm-research/SEALGuard to support further research.
Chinese: SEALGuard 是一种多语言防护机制,显著提升了检测多种语言中不安全及越狱提示的能力,相比现有方法如 LlamaGuard,其防御成功率提高了48%,并在精确率和 F1 分数上表现最佳。
English: SEALGuard is a multilingual guardrail that significantly enhances the detection of unsafe and jailbreak prompts across diverse languages, outperforming existing methods like LlamaGuard by improving the Defense Success Rate by 48% and achieving top precision and F1-scores.
Authors:Yaowenqi Liu, BingXu Meng, Rui Pan, Jerry Huang, Tong Zhang
Abstract:
The field of AI research is advancing at an unprecedented pace, enabling automated hypothesis generation and experimental design across diverse domains such as biology, mathematics, and artificial intelligence. Despite these advancements, there remains a significant gap in the availability of scalable advising systems capable of providing high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs. To address this challenge, we explore key factors that underlie the development of robust advising systems, including model size, context length, confidence estimation, and structured reasoning processes. Our findings reveal that a relatively small model, when equipped with a well-compressed literature database and a structured reasoning framework, can outperform powerful general-purpose language models such as Deepseek-R1 in terms of acceptance rates for self-ranked top-30% submissions to ICLR 2025. Moreover, when limited to high-confidence predictions, our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set, underscoring its potential to significantly enhance the quality and efficiency of hypothesis generation and experimental design. The code is released at https://github.com/HowardLiu0830/GUIDE-Research-Idea-Evaluation.
Chinese: 人工智能研究飞速发展,但可扩展的假设与实验优化建议系统仍显不足;我们的研究表明,配备压缩文献库和结构化推理的小型模型能超越大型模型,在ICLR 2025的高置信度预测中实现超过90%的接受率。
English: AI research is rapidly advancing, yet scalable advising systems for refining hypotheses and experiments remain limited; our study shows that a compact model with a compressed literature database and structured reasoning can outperform larger models, achieving over 90% acceptance rates on high-confidence predictions for ICLR 2025.
Authors:Yaowenqi Liu, Bingxu Meng, Rui Pan, Yuxing Liu, Jerry Huang, Jiaxuan You, Tong Zhang
Abstract:
The field of AI research is advancing at an unprecedented pace, enabling automated hypothesis generation and experimental design across diverse domains such as biology, mathematics, and artificial intelligence. Despite these advancements, there remains a significant gap in the availability of scalable advising systems capable of providing high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs. To address this challenge, we explore key factors that underlie the development of robust advising systems, including model size, context length, confidence estimation, and structured reasoning processes. Our findings reveal that a relatively small model, when equipped with a well-compressed literature database and a structured reasoning framework, can outperform powerful general-purpose language models such as Deepseek-R1 in terms of acceptance rates for self-ranked top-30% submissions to ICLR 2025. Moreover, when limited to high-confidence predictions, our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set, underscoring its potential to significantly enhance the quality and efficiency of hypothesis generation and experimental design. The code is released at https://github.com/HowardLiu0830/GUIDE-Research-Idea-Evaluation.
Chinese: 人工智能研究飞速发展,但可扩展的假设与实验优化建议系统仍显不足;我们的研究表明,配备压缩文献库和结构化推理的小型模型能超越大型模型,在ICLR 2025的高置信度预测中实现超过90%的接受率。
English: AI research is rapidly advancing, yet scalable advising systems for refining hypotheses and experiments remain limited; our study shows that a compact model with a compressed literature database and structured reasoning can outperform larger models, achieving over 90% acceptance rates on high-confidence predictions for ICLR 2025.
Authors:Awais Manzoor, M. Atif Qureshi, Etain Kidney, Luca Longo
Abstract:
Retention campaigns in customer relationship management often rely on churn prediction models evaluated using traditional metrics such as AUC and F1-score. However, these metrics fail to reflect financial outcomes and may mislead strategic decisions. We introduce e-Profits, a novel business-aligned evaluation metric that quantifies model performance based on customer-specific value, retention probability, and intervention costs. Unlike existing profit-based metrics such as Expected Maximum Profit, which assume fixed population-level parameters, e-Profits uses Kaplan-Meier survival analysis to estimate personalised retention rates and supports granular, per customer evaluation. We benchmark six classifiers across two telecom datasets (IBM Telco and Maven Telecom) and demonstrate that e-Profits reshapes model rankings compared to traditional metrics, revealing financial advantages in models previously overlooked by AUC or F1-score. The metric also enables segment-level insight into which models maximise return on investment for high-value customers. e-Profits is designed as an understandable, post hoc tool to support model evaluation in business contexts, particularly for marketing and analytics teams prioritising profit-driven decisions. All source code is available at: https://github.com/matifq/eprofits.
Chinese: 本研究提出了e-Profits这一与业务对齐的评估指标,它基于客户价值和挽留成本评估客户流失预测模型,相比AUC等传统指标能揭示被忽视模型的财务优势。
English: The study introduces e-Profits, a business-aligned evaluation metric that assesses churn prediction models based on customer value and retention costs, outperforming traditional metrics like AUC by revealing financial advantages in overlooked models.
Authors:Zhufeng Lu, Chentao Jia, Ming Hu, Xiaofei Xie, Mingsong Chen
Abstract:
As a promising privacy-aware collaborative model training paradigm, Federated Learning (FL) is becoming popular in the design of distributed recommender systems. However, Federated Recommender Systems (FedRecs) greatly suffer from two major problems: i) extremely high communication overhead due to massive item embeddings involved in recommendation systems, and ii) intolerably low training efficiency caused by the entanglement of both heterogeneous network environments and client devices. Although existing methods attempt to employ various compression techniques to reduce communication overhead, due to the parameter errors introduced by model compression, they inevitably suffer from model performance degradation. To simultaneously address the above problems, this paper presents a communication-efficient FedRec framework named FedRAS, which adopts an action-sharing strategy to cluster the gradients of item embedding into a specific number of model updating actions for communication rather than directly compressing the item embeddings. In this way, the cloud server can use the limited actions from clients to update all the items. Since gradient values are significantly smaller than item embeddings, constraining the directions of gradients (i.e., the action space) introduces smaller errors compared to compressing the entire item embedding matrix into a reduced space. To accommodate heterogeneous devices and network environments, FedRAS incorporates an adaptive clustering mechanism that dynamically adjusts the number of actions. Comprehensive experiments on well-known datasets demonstrate that FedRAS can reduce the size of communication payloads by up to 96.88%, while not sacrificing recommendation performance within various heterogeneous scenarios. We have open-sourced FedRAS at https://github.com/mastlab-T3S/FedRAS.
中文: 联邦推荐系统面临高通信开销和低训练效率的问题,而FedRAS框架通过共享聚类后的梯度动作而非压缩嵌入,在保证性能的同时将通信负载减少高达96.88%。
English: Federated Recommender Systems (FedRecs) face high communication costs and low training efficiency, but the proposed FedRAS framework reduces payload size by up to 96.88% without performance loss by sharing clustered gradient actions instead of compressing embeddings.
Authors:Kun Jing, Luoyu Chen, Jungang Xu, Jianwei Tai, Yiyu Wang, Shuaimin Li
Abstract:
Neural architecture search (NAS) is a promising approach for automatically designing neural network architectures. However, the architecture estimation of NAS is computationally expensive and time-consuming because of training multiple architectures from scratch. Although existing zero-shot NAS methods use training-free proxies to accelerate the architecture estimation, their effectiveness, stability, and generality are still lacking. We present a novel training-free estimation proxy called weighted response correlation (WRCor). WRCor utilizes correlation coefficient matrices of responses across different input samples to calculate the proxy scores of estimated architectures, which can measure their expressivity and generalizability. Experimental results on proxy evaluation demonstrate that WRCor and its voting proxies are more efficient estimation strategies than existing proxies. We also apply them with different search strategies in architecture search. Experimental results on architecture search show that our zero-shot NAS algorithm outperforms most existing NAS algorithms in different search spaces. Our NAS algorithm can discover an architecture with a 22.1% test error on the ImageNet-1k dataset within 4 GPU hours. All codes are publicly available at https://github.com/kunjing96/ZSNAS-WRCor.git.
中文: 本文提出了一种名为加权响应相关性(WRCor)的新型免训练代理方法,用于神经架构搜索,它能有效评估架构的表达能力和泛化性,在代理评估和架构搜索中均优于现有方法,并在极少的GPU时间内于ImageNet-1k数据集上取得了优异结果。
English: The paper introduces a novel training-free proxy called weighted response correlation (WRCor) for neural architecture search, which efficiently estimates architecture expressivity and generalizability, outperforming existing methods in both proxy evaluation and architecture search while achieving competitive results on ImageNet-1k within minimal GPU time.
Authors:Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, Jiasheng Tang, Fan Wang, Yi Yang
Abstract:
Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.
中文摘要:Lumos-1是一种保持标准LLM架构的自回归视频生成器,通过MM-RoPE实现时空建模,并采用AR-DF解决帧间损失不平衡问题,仅用48个GPU进行高效训练即达到与主流模型相当的性能。
English Summary: Lumos-1 is an autoregressive video generator that maintains the standard LLM architecture with minimal changes, incorporating MM-RoPE for spatiotemporal modeling and AR-DF to address frame-wise loss imbalance, achieving competitive performance with efficient training on just 48 GPUs.
Authors:Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, Maosong Sun
Abstract:
To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$\times$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).
中文: BlockFFN提出了一种新型MoE架构,通过可微分路由和块级稀疏训练目标,在实现卓越性能的同时具备加速友好特性,其高效内核在终端设备上取得了显著加速效果。
English: BlockFFN introduces a novel MoE architecture with differentiable routing and chunk-level sparsity training objectives, achieving superior performance and acceleration-friendliness while enabling efficient kernel implementation for significant speedup on end-side devices.
Authors:Rei Tamaru, Pei Li, Bin Ran
Abstract:
Digital Twins (DT) have the potential to transform traffic management and operations by creating dynamic, virtual representations of transportation systems that sense conditions, analyze operations, and support decision-making. A key component for DT of the transportation system is dynamic roadway geometry sensing. However, existing approaches often rely on static maps or costly sensors, limiting scalability and adaptability. Additionally, large-scale DTs that collect and analyze data from multiple sources face challenges in privacy, communication, and computational efficiency. To address these challenges, we introduce Geo-ORBIT (Geometrical Operational Roadway Blueprint with Integrated Twin), a unified framework that combines real-time lane detection, DT synchronization, and federated meta-learning. At the core of Geo-ORBIT is GeoLane, a lightweight lane detection model that learns lane geometries from vehicle trajectory data using roadside cameras. We extend this model through Meta-GeoLane, which learns to personalize detection parameters for local entities, and FedMeta-GeoLane, a federated learning strategy that ensures scalable and privacy-preserving adaptation across roadside deployments. Our system is integrated with CARLA and SUMO to create a high-fidelity DT that renders highway scenarios and captures traffic flows in real-time. Extensive experiments across diverse urban scenes show that FedMeta-GeoLane consistently outperforms baseline and meta-learning approaches, achieving lower geometric error and stronger generalization to unseen locations while drastically reducing communication overhead. This work lays the foundation for flexible, context-aware infrastructure modeling in DTs. The framework is publicly available at https://github.com/raynbowy23/FedMeta-GeoLane.git.
中文摘要:Geo-ORBIT框架通过整合实时车道检测、数字孪生同步和联邦元学习,解决了交通数字孪生在可扩展性、隐私保护和计算效率方面的挑战,大量实验证明其性能优于现有方法。
English Summary: The Geo-ORBIT framework introduces a unified approach combining real-time lane detection, digital twin synchronization, and federated meta-learning to overcome scalability, privacy, and efficiency challenges in transportation digital twins, demonstrating superior performance through extensive experiments.
Authors:Tianlong Ai, Tianzhu Liu, Haochen Jiang, Yanfeng Gu
Abstract:
Hierarchical land cover and land use (LCLU) classification aims to assign pixel-wise labels with multiple levels of semantic granularity to remote sensing (RS) imagery. However, existing deep learning-based methods face two major challenges: 1) They predominantly adopt a flat classification paradigm, which limits their ability to generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies used in practice. 2) Most cross-domain studies focus on performance degradation caused by sensor or scene variations, with limited attention to transferring LCLU models to cross-domain tasks with heterogeneous hierarchies (e.g., LCLU to crop classification). These limitations hinder the flexibility and generalization of LCLU models in practical applications. To address these challenges, we propose HieraRS, a novel hierarchical interpretation paradigm that enables multi-granularity predictions and supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies. We introduce the Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM), which can be seamlessly integrated into mainstream flat classification models to generate hierarchical predictions, while improving both semantic consistency and classification accuracy. Furthermore, we present TransLU, a dual-branch cross-domain transfer framework comprising two key components: Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). TransLU supports dynamic category expansion and facilitates the effective adaptation of LCLU models to heterogeneous hierarchies. In addition, we construct MM-5B, a large-scale multi-modal hierarchical land use dataset featuring pixel-wise annotations. The code and MM-5B dataset will be released at: https://github.com/AI-Tianlong/HieraRS.
中文摘要:该摘要提出了HieraRS这一新颖的层次化解释范式,通过双向层次一致性约束机制和跨域迁移框架,解决了现有遥感影像分类方法在多层次语义粒度预测和跨异构层次结构迁移方面的局限性。
English Summary: The abstract introduces HieraRS, a novel hierarchical interpretation paradigm that addresses limitations in existing deep learning methods for land cover and land use classification by enabling multi-granularity predictions and supporting cross-domain transfer through innovative mechanisms like BHCCM and TransLU.
Authors:Yuqiang Lin, Sam Lockyer, Mingxuan Sui, Li Gan, Florian Stanek, Markus Zarbock, Wenbin Li, Adrian Evans, Nic Zhang
Abstract:
The multi-camera vehicle tracking (MCVT) framework holds significant potential for smart city applications, including anomaly detection, traffic density estimation, and suspect vehicle tracking. However, current publicly available datasets exhibit limitations, such as overly simplistic scenarios, low-resolution footage, and insufficiently diverse conditions, creating a considerable gap between academic research and real-world scenario. To fill this gap, we introduce RoundaboutHD, a comprehensive, high-resolution multi-camera vehicle tracking benchmark dataset specifically designed to represent real-world roundabout scenarios. RoundaboutHD provides a total of 40 minutes of labelled video footage captured by four non-overlapping, high-resolution (4K resolution, 15 fps) cameras. In total, 512 unique vehicle identities are annotated across different camera views, offering rich cross-camera association data. RoundaboutHD offers temporal consistency video footage and enhanced challenges, including increased occlusions and nonlinear movement inside the roundabout. In addition to the full MCVT dataset, several subsets are also available for object detection, single camera tracking, and image-based vehicle re-identification (ReID) tasks. Vehicle model information and camera modelling/ geometry information are also included to support further analysis. We provide baseline results for vehicle detection, single-camera tracking, image-based vehicle re-identification, and multi-camera tracking. The dataset and the evaluation code are publicly available at: https://github.com/siri-rouser/RoundaboutHD.git
中文: RoundaboutHD数据集通过提供真实环岛场景的高分辨率连续视频和全面标注,弥补了现有多摄像头车辆追踪数据集的不足,并为多种追踪任务提供了基准结果。
English: The RoundaboutHD dataset addresses the limitations of existing multi-camera vehicle tracking benchmarks by providing high-resolution, temporally consistent footage from real-world roundabouts, complete with comprehensive annotations and baseline results for various tracking tasks.
Authors:Dominik Schweisgut, Anne Benoit, Yves Robert, Henning Meyerhenke
Abstract:
Large data and computing centers consume a significant share of the world's energy consumption. A prominent subset of the workloads in such centers are workflows with interdependent tasks, usually represented as directed acyclic graphs (DAGs). To reduce the carbon emissions resulting from executing such workflows in centers with a mixed (renewable and non-renewable) energy supply, it is advisable to move task executions to time intervals with sufficient green energy when possible. To this end, we formalize the above problem as a scheduling problem with a given mapping and ordering of the tasks. We show that this problem can be solved in polynomial time in the uniprocessor case. For at least two processors, however, the problem becomes NP-hard. Hence, we propose a heuristic framework called CaWoSched that combines several greedy approaches with local search. To assess the 16 heuristics resulting from different combinations, we also devise a simple baseline algorithm and an exact ILP-based solution. Our experimental results show that our heuristics provide significant savings in carbon emissions compared to the baseline.
中文: 为降低混合能源数据中心碳排放,本研究针对单处理器提出多项式时间调度方案,对多处理器则开发了启发式框架CaWoSched,实验证明其减排效果显著优于基准算法。
English: To reduce carbon emissions from data centers using mixed energy sources, this study introduces a polynomial-time scheduling solution for uniprocessors and an NP-hard heuristic framework, CaWoSched, for multiprocessors, which significantly cuts emissions compared to baselines.
Authors:Kongwu Huang, Shiyi Mu, Jun Jiang, Yuan Gao, Shugong Xu
Abstract:
Scaling laws have achieved success in LLM and foundation models. To explore their potential in ISAC research, we propose Great-X. This single-engine multimodal data twin platform reconstructs the ray-tracing computation of Sionna within Unreal Engine and is deeply integrated with autonomous driving tools. This enables efficient and synchronized simulation of multimodal data, including CSI, RGB, Radar, and LiDAR. Based on this platform, we construct an open-source, large-scale, low-altitude UAV multimodal synaesthesia dataset named Great-MSD, and propose a baseline CSI-based UAV 3D localization algorithm, demonstrating its feasibility and generalizability across different CSI simulation engines. The related code and dataset will be made available at: https://github.com/hkw-xg/Great-MCD.
中文: Great-X 是一个集成光线追踪与自动驾驶工具的多模态数据孪生平台,可高效生成同步仿真数据;基于此构建的Great-MSD数据集与CSI无人机定位算法,验证了跨仿真引擎的可行性与泛化能力。
English: Great-X is a multimodal data twin platform that integrates ray-tracing simulation with autonomous driving tools to efficiently generate synchronized data, while the derived Great-MSD dataset and CSI-based localization algorithm demonstrate cross-engine feasibility for UAV applications.
Authors:Xingguang Ji, Yahui Liu, Qi Wang, Jingyuan Zhang, Yang Yue, Rui Shi, Chenxi Sun, Fuzheng Zhang, Guorui Zhou, Kun Gai
Abstract:
We introduce our Leanabell-Prover-V2, a 7B large language models (LLMs) that can produce formal theorem proofs in Lean 4, with verifier-integrated Long Chain-of-Thoughts (CoT). Following our previous work Leanabell-Prover-V1, we continual to choose to posttrain existing strong prover models for further performance improvement. In our V2 version, we mainly upgrade the Reinforcement Learning (RL) with feedback provided by the Lean 4 verifier. Crucially, verifier feedback, such as indicating success or detailing specific errors, allows the LLM to become ``self-aware'' of the correctness of its own reasoning process and learn to reflexively correct errors. Leanabell-Prover-V2 directly optimizes LLM reasoning trajectories with multi-turn verifier interactions, together with feedback token masking for stable RL training and a simple reward strategy. Experiments show that Leanabell-Prover-V2 improves performance by 3.2% (pass@128) with Kimina-Prover-Preview-Distill-7B and 2.0% (pass@128) with DeepSeek-Prover-V2-7B on the MiniF2F test set. The source codes, curated data and models are available at: https://github.com/Leanabell-LM/Leanabell-Prover-V2.
Chinese: Leanabell-Prover-V2 是一个 70 亿参数的大型语言模型,通过集成 Lean 4 验证器的反馈实现自我纠错和强化学习优化,在 MiniF2F 测试集上最高提升 3.2% 的定理证明性能。
English: Leanabell-Prover-V2 is a 7B LLM that enhances theorem proving in Lean 4 by integrating verifier feedback for self-aware error correction and improved reasoning through upgraded RL training, achieving performance gains of up to 3.2% on the MiniF2F test set.
Authors:Shuang Cui, Jinglin Xu, Yi Li, Xiongxin Tang, Jiangmeng Li, Jiahuan Zhou, Fanjiang Xu, Fuchun Sun, Hui Xiong
Abstract:
Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but degrade significantly under \textit{temporally evolving distribution shifts} common in real-world scenarios (e.g., gradual illumination or seasonal changes). Existing continual test-time adaptation (CTTA) methods are typically built around sudden and severe distribution shifts and neglect temporal continuity, leading to three core defects: limited memory cache restricts long-range distribution modeling, causing catastrophic forgetting; entropy-based confidence becomes unreliable under temporal drift, worsening error accumulation; and static visual representations misalign with evolving inputs. We formalize this practical problem as \textit{Continual-Temporal Test-Time Adaptation (CT-TTA)}, where test distributions evolve gradually over time. To address it, we propose \textit{BayesTTA}, a Bayesian adaptation framework that enforces temporally consistent predictions and dynamically aligns visual representations. Specifically, BayesTTA incrementally estimates class-conditional Gaussian mixture distributions without storing raw data, adaptively selects covariance structures through statistical hypothesis testing, and performs calibrated inference using Gaussian discriminant analysis (GDA). These calibrated predictions supervise self-paced adaptation of normalization layers, ensuring efficient and stable representation alignment. We establish a comprehensive CT-TTA benchmark across four temporally evolving datasets and further evaluate generalization on ten standard TTA datasets. Extensive experiments show that BayesTTA consistently outperforms state-of-the-art methods, achieving significant gains while maintaining efficiency. Code is available at \href{https://github.com/cuishuang99/BayesTTA}{https://github.com/cuishuang99/BayesTTA}.
Chinese: 视觉语言模型在处理现实世界中逐渐变化的分布偏移时表现不佳,因此我们提出了BayesTTA这一贝叶斯框架,通过确保时间一致性和动态表征对齐,显著超越了现有方法的性能。
English: Vision-language models struggle with gradual real-world distribution shifts, so we propose BayesTTA, a Bayesian framework that ensures temporal consistency and dynamic representation alignment to significantly outperform existing methods.
Authors:Junyu Chen, Yihua Gao, Mingyong Li
Abstract:
Image-text matching (ITM) aims to address the fundamental challenge of aligning visual and textual modalities, which inherently differ in their representations, continuous, high-dimensional image features vs. discrete, structured text. We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantic parsers. By generating rich Visual Semantic Descriptions (VSD), MLLMs provide semantic anchor that facilitate cross-modal alignment. Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency. These modules can be seamlessly integrated into existing ITM models. Extensive experiments on Flickr30K and MSCOCO demonstrate substantial performance improvements. The approach also exhibits remarkable zero-shot generalization to cross-domain tasks, including news and remote sensing ITM. The code and model checkpoints are available at https://github.com/Image-Text-Matching/VSD.
中文摘要:本研究提出了一种新颖框架,利用多模态大语言模型作为视觉语义解析器来弥合图像-文本匹配中的模态差异,在标准基准测试中取得显著性能提升,并展现出跨领域的强大零样本泛化能力。
English Summary: This study introduces a novel framework that uses multimodal large language models as visual semantic parsers to bridge the modality gap in image-text matching, achieving significant performance improvements on standard benchmarks and demonstrating strong zero-shot generalization across domains.
Authors:Enyu Liu, En Yu, Sijia Chen, Wenbing Tao
Abstract:
3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose \textbf{D}isentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-of-the-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only single-frame input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9\% and 11.9\%, respectively, on the SemanticKITTI hidden test. The code is available at https://github.com/Enyu-Liu/DISC.
中文: 提出的DISC方法通过双流范式使用类别查询来分离实例与场景优化,从而增强3D语义场景补全,在多个基准测试中取得最优性能,并在实例类别上实现显著提升。
English: The proposed DISC method introduces a dual-stream paradigm using class queries to enhance 3D semantic scene completion by separating instance and scene optimization, achieving state-of-the-art results on major benchmarks with significant improvements in instance category performance.
Authors:Inye Na, Nejung Rue, Jiwon Chung, Hyunjin Park
Abstract:
Medical image retrieval is a valuable field for supporting clinical decision-making, yet current methods primarily support 2D images and require fully annotated queries, limiting clinical flexibility. To address this, we propose RadiomicsRetrieval, a 3D content-based retrieval framework bridging handcrafted radiomics descriptors with deep learning-based embeddings at the tumor level. Unlike existing 2D approaches, RadiomicsRetrieval fully exploits volumetric data to leverage richer spatial context in medical images. We employ a promptable segmentation model (e.g., SAM) to derive tumor-specific image embeddings, which are aligned with radiomics features extracted from the same tumor via contrastive learning. These representations are further enriched by anatomical positional embedding (APE). As a result, RadiomicsRetrieval enables flexible querying based on shape, location, or partial feature sets. Extensive experiments on both lung CT and brain MRI public datasets demonstrate that radiomics features significantly enhance retrieval specificity, while APE provides global anatomical context essential for location-based searches. Notably, our framework requires only minimal user prompts (e.g., a single point), minimizing segmentation overhead and supporting diverse clinical scenarios. The capability to query using either image embeddings or selected radiomics attributes highlights its adaptability, potentially benefiting diagnosis, treatment planning, and research on large-scale medical imaging repositories. Our code is available at https://github.com/nainye/RadiomicsRetrieval.
中文摘要:RadiomicsRetrieval提出了一种三维医学图像检索框架,通过对比学习将影像组学特征与深度学习嵌入相结合,仅需少量标记即可实现基于肿瘤特征的灵活检索,在肺部和脑部影像数据上显著优于现有二维方法。
English Summary: RadiomicsRetrieval introduces a 3D medical image retrieval framework that combines radiomics features with deep learning embeddings through contrastive learning, enabling flexible tumor-level queries using minimal prompts while outperforming existing 2D methods across lung and brain imaging datasets.
Authors:Shibo Sun, Xue Li, Donglin Di, Mingjie Wei, Lanshun Nie, Wei-Nan Zhang, Dechen Zhan, Yang Song, Lei Fan
Abstract:
While large language models (LLMs) have advanced procedural planning for embodied AI systems through strong reasoning abilities, the integration of multimodal inputs and counterfactual reasoning remains underexplored. To tackle these challenges, we introduce LLaPa, a vision-language model framework designed for multimodal procedural planning. LLaPa generates executable action sequences from textual task descriptions and visual environmental images using vision-language models (VLMs). Furthermore, we enhance LLaPa with two auxiliary modules to improve procedural planning. The first module, the Task-Environment Reranker (TER), leverages task-oriented segmentation to create a task-sensitive feature space, aligning textual descriptions with visual environments and emphasizing critical regions for procedural execution. The second module, the Counterfactual Activities Retriever (CAR), identifies and emphasizes potential counterfactual conditions, enhancing the model's reasoning capability in counterfactual scenarios. Extensive experiments on ActPlan-1K and ALFRED benchmarks demonstrate that LLaPa generates higher-quality plans with superior LCS and correctness, outperforming advanced models. The code and models are available https://github.com/sunshibo1234/LLaPa.
中文: LLaPa是一种视觉语言模型框架,通过整合任务环境对齐和反事实推理来增强多模态程序规划,利用文本和视觉输入生成可执行的动作序列,在基准测试中表现出卓越性能。
English: LLaPa is a vision-language model framework that enhances multimodal procedural planning by integrating task-environment alignment and counterfactual reasoning, achieving superior performance on benchmarks through executable action sequences generated from text and visual inputs.
Authors:Heng Li, Qingcai Chen, Xiangping Wu
Abstract:
Document image dewarping remains a challenging task in the deep learning era. While existing methods have improved by leveraging text line awareness, they typically focus only on a single horizontal dimension. In this paper, we propose a fine-grained deformation perception model that focuses on Dual Dimensions of document horizontal-vertical-lines to improve document Dewarping called D2Dewarp. It can perceive distortion trends in different directions across document details. To combine the horizontal and vertical granularity features, an effective fusion module based on X and Y coordinate is designed to facilitate interaction and constraint between the two dimensions for feature complementarity. Due to the lack of annotated line features in current public dewarping datasets, we also propose an automatic fine-grained annotation method using public document texture images and an automatic rendering engine to build a new large-scale distortion training dataset. The code and dataset will be publicly released. On public Chinese and English benchmarks, both quantitative and qualitative results show that our method achieves better rectification results compared with the state-of-the-art methods. The dataset will be publicly available at https://github.com/xiaomore/DocDewarpHV
中文: 本文提出D2Dewarp双维度文档校正模型,通过基于坐标的融合模块整合水平与垂直线特征,并利用自动标注的新数据集在基准测试中实现了更优的校正效果。
English: This paper introduces D2Dewarp, a dual-dimensional document dewarping model that leverages horizontal and vertical line features through a coordinate-based fusion module, achieving superior rectification results on benchmarks by utilizing a novel automatically annotated dataset.
Authors:David Schlangen, Sherzod Hakimov, Jonathan Jordan, Philipp Sadler
Abstract:
There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available. The second, best exemplified by the LM-arena, relies on (often self-selected) users bringing their own intents to a site that routes these to several models in parallel, among whose responses the user then selects their most preferred one. The former paradigm hence excels at control over what is tested, while the latter comes with higher ecological validity, testing actual use cases interactively. Recently, a third complementary paradigm has emerged that combines some of the strengths of these approaches, offering control over multi-turn, reference-free, repeatable interactions, while stressing goal-directedness: dialogue game based evaluation. While the utility of this approach has been shown by several projects, its adoption has been held back by the lack of a mature, easily re-usable implementation. In this paper, we present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use. We describe how it can be used to benchmark one's own models (using a provided set of benchmark game instances in English), as well as how easily the benchmark itself can be extended with new, tailor-made targeted tests.
中文: 本文介绍了clembench这一成熟工具,它通过对话游戏评估法结合了基于参考评估的控制性和基于偏好评估的生态效度,为大语言模型提供可定制、可重复的基准测试框架。
English: The paper introduces clembench, a mature tool for dialogue game-based evaluation of large language models that combines the controlled testing of reference-based methods with the ecological validity of preference-based approaches, enabling customizable and repeatable benchmarking.
Authors:Zhanxin Gao, Beier Zhu, Liang Yao, Jian Yang, Ying Tai
Abstract:
Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in https://github.com/NJU-PCALab/CoDi.
中文摘要:CoDi框架通过身份特征传输和细节优化两阶段策略,在保持主体一致性的同时实现了姿态与布局的多样性,全面提升了文本到图像生成的视觉效果与性能指标。
English Summary: The proposed CoDi framework overcomes the trade-off between subject consistency and pose diversity in text-to-image generation by implementing a two-stage diffusion process that transfers identity features early and refines details later.
Authors:Anthony Miyaguchi, Imran Afrulbasha, Aleksandar Pramov
Abstract:
Information Retrieval (IR) models are often trained on static datasets, making them vulnerable to performance degradation as web content evolves. The DS@GT competition team participated in the Longitudinal Evaluation of Model Performance (LongEval) lab at CLEF 2025, which evaluates IR systems across temporally distributed web snapshots. Our analysis of the Qwant web dataset includes exploratory data analysis with topic modeling over time. The two-phase retrieval system employs sparse keyword searches, utilizing query expansion and document reranking. Our best system achieves an average NDCG@10 of 0.296 across the entire training and test dataset, with an overall best score of 0.395 on 2023-05. The accompanying source code for this paper is at https://github.com/dsgt-arc/longeval-2025
中文: DS@GT团队开发了包含查询扩展和重排序的两阶段检索系统,以应对网络内容演变导致的性能下降,系统平均NDCG@10达0.296,并在2023年5月取得0.395的最高分。
English: The DS@GT team developed a two-phase retrieval system using query expansion and reranking to address performance degradation in evolving web content, achieving an average NDCG@10 of 0.296 with a peak score of 0.395 in May 2023.
Authors:Shishuai Hu, Zehui Liao, Liangli Zhen, Huazhu Fu, Yong Xia
Abstract:
In-context learning (ICL) is emerging as a promising technique for achieving universal medical image segmentation, where a variety of objects of interest across imaging modalities can be segmented using a single model. Nevertheless, its performance is highly sensitive to the alignment between the query image and in-context image-mask pairs. In a clinical scenario, the scarcity of annotated medical images makes it challenging to select optimal in-context pairs, and fine-tuning foundation ICL models on contextual data is infeasible due to computational costs and the risk of catastrophic forgetting. To address this challenge, we propose Cycle Context Verification (CCV), a novel framework that enhances ICL-based medical image segmentation by enabling self-verification of predictions and accordingly enhancing contextual alignment. Specifically, CCV employs a cyclic pipeline in which the model initially generates a segmentation mask for the query image. Subsequently, the roles of the query and an in-context pair are swapped, allowing the model to validate its prediction by predicting the mask of the original in-context image. The accuracy of this secondary prediction serves as an implicit measure of the initial query segmentation. A query-specific prompt is introduced to alter the query image and updated to improve the measure, thereby enhancing the alignment between the query and in-context pairs. We evaluated CCV on seven medical image segmentation datasets using two ICL foundation models, demonstrating its superiority over existing methods. Our results highlight CCV's ability to enhance ICL-based segmentation, making it a robust solution for universal medical image segmentation. The code will be available at https://github.com/ShishuaiHu/CCV.
中文:提出的循环上下文验证(CCV)框架通过循环角色交换机制实现预测的自我验证,无需微调即可提升上下文对齐效果,在多个医学图像分割数据集上展现出优越性能,增强了基于上下文学习的通用医学图像分割能力。
English: The proposed Cycle Context Verification (CCV) framework enhances in-context learning for universal medical image segmentation by enabling self-verification of predictions through a cyclic role-swapping mechanism, improving contextual alignment without requiring fine-tuning and demonstrating superior performance across multiple datasets.
Authors:Keisuke Ueda, Wataru Hirota, Takuto Asakura, Takahiro Omi, Kosuke Takahashi, Kosuke Arima, Tatsuya Ishigaki
Abstract:
Large language models (LLMs) are increasingly used to support creative tasks such as research idea generation. While recent work has shown that structured dialogues between LLMs can improve the novelty and feasibility of generated ideas, the optimal design of such interactions remains unclear. In this study, we conduct a comprehensive analysis of multi-agent LLM dialogues for scientific ideation. We compare different configurations of agent roles, number of agents, and dialogue depth to understand how these factors influence the novelty and feasibility of generated ideas. Our experimental setup includes settings where one agent generates ideas and another critiques them, enabling iterative improvement. Our results show that enlarging the agent cohort, deepening the interaction depth, and broadening agent persona heterogeneity each enrich the diversity of generated ideas. Moreover, specifically increasing critic-side diversity within the ideation-critique-revision loop further boosts the feasibility of the final proposals. Our findings offer practical guidelines for building effective multi-agent LLM systems for scientific ideation. Our code is available at https://github.com/g6000/MultiAgent-Research-Ideator.
中文: 多智能体大语言模型对话通过增加智能体数量、交互深度和角色多样性来提升科学创意的创新性与可行性,其中批评方多样性能进一步优化最终方案的实用性。
English: Multi-agent LLM dialogues enhance the novelty and feasibility of scientific ideas by increasing agent numbers, interaction depth, and persona diversity, with critic-side diversity particularly improving proposal viability.
Authors:Jihao Gu, Fei Wang, Kun Li, Yanyan Wei, Zhiliang Wu, Dan Guo
Abstract:
In this paper, we present MM-Gesture, the solution developed by our team HFUT-VUT, which ranked 1st in the micro-gesture classification track of the 3rd MiGA Challenge at IJCAI 2025, achieving superior performance compared to previous state-of-the-art methods. MM-Gesture is a multimodal fusion framework designed specifically for recognizing subtle and short-duration micro-gestures (MGs), integrating complementary cues from joint, limb, RGB video, Taylor-series video, optical-flow video, and depth video modalities. Utilizing PoseConv3D and Video Swin Transformer architectures with a novel modality-weighted ensemble strategy, our method further enhances RGB modality performance through transfer learning pre-trained on the larger MA-52 dataset. Extensive experiments on the iMiGUE benchmark, including ablation studies across different modalities, validate the effectiveness of our proposed approach, achieving a top-1 accuracy of 73.213%. Code is available at: https://github.com/momiji-bit/MM-Gesture.
中文总结:我们团队提出的MM-Gesture多模态融合框架在IJCAI 2025微手势识别挑战赛中夺冠,通过整合多种视觉模态在iMiGUE基准上实现了73.213%的准确率。
English Summary: Our team's MM-Gesture multimodal framework achieved first place in the IJCAI 2025 micro-gesture challenge by integrating multiple visual modalities and achieving 73.213% accuracy on the iMiGUE benchmark.
Authors:Jia-Xuan Jiang, Jiashuai Liu, Hongtao Wu, Yifeng Wu, Zhong Wang, Qi Bi, Yefeng Zheng
Abstract:
Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at https://github.com/HopkinsKwong/MCCSDG
中文: 本研究针对多模态生存预测提出跨癌症泛化新任务,通过可插拔模块SDIR和CADE分别实现模态特征重平衡与目标域分布合成,在四类癌症基准测试中展现出卓越的泛化能力,为临床实践奠定坚实基础。
English: This study introduces a novel cross-cancer generalization task for multimodal survival prediction and proposes two plug-and-play modules—SDIR and CADE—that significantly enhance model robustness by rebalancing modality contributions and synthesizing target domain distributions, validated through superior performance on a four-cancer benchmark.
Authors:Jia-Xuan Jiang, Jiashuai Liu, Hongtao Wu, Yifeng Wu, Zhong Wang, Qi Bi, Yefeng Zheng
Abstract:
Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at https://github.com/HopkinsKwong/MCCSDG
中文: 本研究针对多模态生存预测提出跨癌症泛化新任务,通过可插拔模块SDIR和CADE分别实现模态特征重平衡与目标域分布合成,在四类癌症基准测试中展现出卓越的泛化能力,为临床实践奠定坚实基础。
English: This study introduces a novel cross-cancer generalization task for multimodal survival prediction and proposes two plug-and-play modules—SDIR and CADE—that significantly enhance model robustness by rebalancing modality contributions and synthesizing target domain distributions, validated through superior performance on a four-cancer benchmark.
Authors:Hiroshi Yoshihara, Taiki Yamaguchi, Yuichi Inoue
Abstract:
Enhancing the mathematical reasoning of Large Language Models (LLMs) is a pivotal challenge in advancing AI capabilities. While Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are the dominant training paradigms, a systematic methodology for combining them to maximize both accuracy and efficiency remains largely unexplored. This paper introduces a practical and effective training recipe that strategically integrates extended SFT with RL from online inference (GRPO). We posit that these methods play complementary, not competing, roles: a prolonged SFT phase first pushes the model's accuracy to its limits, after which a GRPO phase dramatically improves token efficiency while preserving this peak performance. Our experiments reveal that extending SFT for as many as 10 epochs is crucial for performance breakthroughs, and that the primary role of GRPO in this framework is to optimize solution length. The efficacy of our recipe is rigorously validated through top-tier performance on challenging benchmarks, including a high rank among over 2,200 teams in the strictly leak-free AI Mathematical Olympiad (AIMO). This work provides the community with a battle-tested blueprint for developing state-of-the-art mathematical reasoners that are both exceptionally accurate and practically efficient. To ensure full reproducibility and empower future research, we will open-source our entire framework, including all code, model checkpoints, and training configurations at https://github.com/analokmaus/kaggle-aimo2-fast-math-r1.
中文摘要:本文提出了一种结合延长监督微调与强化学习的混合训练方法,显著提升大语言模型的数学推理能力,在保持顶尖准确率的同时优化计算效率,并在权威基准测试中取得优异表现。
English Summary: This paper introduces a hybrid training method combining extended supervised fine-tuning with reinforcement learning to enhance LLMs' mathematical reasoning, achieving top performance in benchmarks while optimizing efficiency.
Authors:Jason Kahei Tam, Murilo Gustineli, Anthony Miyaguchi
Abstract:
Accurate identification of fungi species presents a unique challenge in computer vision due to fine-grained inter-species variation and high intra-species variation. This paper presents our approach for the FungiCLEF 2025 competition, which focuses on few-shot fine-grained visual categorization (FGVC) using the FungiTastic Few-Shot dataset. Our team (DS@GT) experimented with multiple vision transformer models, data augmentation, weighted sampling, and incorporating textual information. We also explored generative AI models for zero-shot classification using structured prompting but found them to significantly underperform relative to vision-based models. Our final model outperformed both competition baselines and highlighted the effectiveness of domain specific pretraining and balanced sampling strategies. Our approach ranked 35/74 on the private test set in post-completion evaluation, this suggests additional work can be done on metadata selection and domain-adapted multi-modal learning. Our code is available at https://github.com/dsgt-arc/fungiclef-2025.
中文摘要:本文介绍了DS@GT团队在FungiCLEF 2025竞赛中采用视觉Transformer结合数据增强与文本信息的方法,通过领域特定预训练提升了真菌细粒度分类性能,同时指出了在元数据选择和多模态学习方面的改进空间。
English Summary: This paper details DS@GT's FungiCLEF 2025 competition approach using vision transformers with data augmentation and textual information, achieving improved performance through domain-specific pretraining while identifying potential enhancements in metadata selection and multimodal learning.
Authors:Anthony Miyaguchi, Murilo Gustineli, Adrian Cheung
Abstract:
The BirdCLEF+ 2025 challenge requires classifying 206 species, including birds, mammals, insects, and amphibians, from soundscape recordings under a strict 90-minute CPU-only inference deadline, making many state-of-the-art deep learning approaches impractical. To address this constraint, the DS@GT BirdCLEF team explored two strategies. First, we establish competitive baselines by optimizing pre-trained models from the Bioacoustics Model Zoo for CPU inference. Using TFLite, we achieved a nearly 10x inference speedup for the Perch model, enabling it to run in approximately 16 minutes and achieve a final ROC-AUC score of 0.729 on the public leaderboard post-competition and 0.711 on the private leaderboard. The best model from the zoo was BirdSetEfficientNetB1, with a public score of 0.810 and a private score of 0.778. Second, we introduce a novel, lightweight pipeline named Spectrogram Token Skip-Gram (STSG) that treats bioacoustics as a sequence modeling task. This method converts audio into discrete "spectrogram tokens" by clustering Mel-spectrograms using Faiss K-means and then learns high-quality contextual embeddings for these tokens in an unsupervised manner with a Word2Vec skip-gram model. For classification, embeddings within a 5-second window are averaged and passed to a linear model. With a projected inference time of 6 minutes for a 700-minute test set, the STSG approach achieved a final ROC-AUC public score of 0.559 and a private score of 0.520, demonstrating the viability of fast tokenization approaches with static embeddings for bioacoustic classification. Supporting code for this paper can be found at https://github.com/dsgt-arc/birdclef-2025.
中文:BirdCLEF+ 2025挑战赛要求在90分钟CPU推理限制下对206个物种进行分类,DS@GT团队为此优化了预训练生物声学模型的速度,并提出了将生物声学视为序列建模任务的创新谱图标记跳元模型方法。
English: The BirdCLEF+ 2025 challenge required classifying 206 species under a strict 90-minute CPU inference constraint, prompting the DS@GT team to optimize pre-trained bioacoustic models for speed and introduce a novel Spectrogram Token Skip-Gram pipeline that treats bioacoustics as a sequence modeling task.
Authors:Chan Young Park, Jillian Fisher, Marius Memmel, Dipika Khullar, Seoho Yun, Abhishek Gupta, Yejin Choi
Abstract:
Large language models (LLMs) have shown promise in robotic procedural planning, yet their human-centric reasoning often omits the low-level, grounded details needed for robotic execution. Vision-language models (VLMs) offer a path toward more perceptually grounded plans, but current methods either rely on expensive, large-scale models or are constrained to narrow simulation settings. We introduce SelfReVision, a lightweight and scalable self-improvement framework for vision-language procedural planning. SelfReVision enables small VLMs to iteratively critique, revise, and verify their own plans-without external supervision or teacher models-drawing inspiration from chain-of-thought prompting and self-instruct paradigms. Through this self-distillation loop, models generate higher-quality, execution-ready plans that can be used both at inference and for continued fine-tuning. Using models varying from 3B to 72B, our results show that SelfReVision not only boosts performance over weak base VLMs but also outperforms models 100X the size, yielding improved control in downstream embodied tasks.
中文摘要:SelfReVision是一种自改进框架,能让小型视觉语言模型通过迭代式自我评估和修订,自主完善机器人程序规划,无需外部监督即可实现超越超大规模模型的性能表现。
English Summary: SelfReVision is a self-improving framework that enables small vision-language models to autonomously refine their robotic procedural plans through iterative critique and revision, achieving superior performance over much larger models without external supervision.
Authors:Duygu Nur Yaldiz, Yavuz Faruk Bakman, Sungmin Kang, Alperen ÃziÅ, Hayrettin Eren Yildiz, Mitash Ashish Shah, Zhiqi Huang, Anoop Kumar, Alfy Samuel, Daben Liu, Sai Praneeth Karimireddy, Salman Avestimehr
Abstract:
Generative Large Language Models (LLMs)inevitably produce untruthful responses. Accurately predicting the truthfulness of these outputs is critical, especially in high-stakes settings. To accelerate research in this domain and make truthfulness prediction methods more accessible, we introduce TruthTorchLM an open-source, comprehensive Python library featuring over 30 truthfulness prediction methods, which we refer to as Truth Methods. Unlike existing toolkits such as Guardrails, which focus solely on document-grounded verification, or LM-Polygraph, which is limited to uncertainty-based methods, TruthTorchLM offers a broad and extensible collection of techniques. These methods span diverse tradeoffs in computational cost, access level (e.g., black-box vs white-box), grounding document requirements, and supervision type (self-supervised or supervised). TruthTorchLM is seamlessly compatible with both HuggingFace and LiteLLM, enabling support for locally hosted and API-based models. It also provides a unified interface for generation, evaluation, calibration, and long-form truthfulness prediction, along with a flexible framework for extending the library with new methods. We conduct an evaluation of representative truth methods on three datasets, TriviaQA, GSM8K, and FactScore-Bio. The code is available at https://github.com/Ybakman/TruthTorchLM
Chinese: 为解决生成式大语言模型产生不实回答的问题,TruthTorchLM作为一个开源Python库被推出,提供30多种多样化的真实性预测方法,并在TriviaQA和GSM8K等数据集上进行了评估。
English: To address the challenge of untruthful responses from generative large language models, TruthTorchLM is introduced as an open-source Python library offering over 30 diverse truthfulness prediction methods, with evaluations conducted on datasets like TriviaQA and GSM8K.
Authors:Xiwen Chen, Peijie Qiu, Wenhui Zhu, Hao Wang, Huayu Li, Xuanzhao Dong, Xiaotong Sun, Xiaobing Yu, Yalin Wang, Abolfazl Razi, Aristeidis Sotiras
Abstract:
While multiple instance learning (MIL) has shown to be a promising approach for histopathological whole slide image (WSI) analysis, its reliance on permutation invariance significantly limits its capacity to effectively uncover semantic correlations between instances within WSIs. Based on our empirical and theoretical investigations, we argue that approaches that are not permutation-invariant but better capture spatial correlations between instances can offer more effective solutions. In light of these findings, we propose a novel alternative to existing MIL for WSI analysis by learning to restore the order of instances from their randomly shuffled arrangement. We term this task as cracking an instance jigsaw puzzle problem, where semantic correlations between instances are uncovered. To tackle the instance jigsaw puzzles, we propose a novel Siamese network solution, which is theoretically justified by optimal transport theory. We validate the proposed method on WSI classification and survival prediction tasks, where the proposed method outperforms the recent state-of-the-art MIL competitors. The code is available at https://github.com/xiwenc1/MIL-JigsawPuzzles.
中文: 本研究提出了一种新颖的多示例学习方法,通过解决实例拼图难题来更好地捕捉组织病理学全切片图像中的空间相关性,在分类和生存预测任务中优于现有最先进方法。
English: This study introduces a novel multiple instance learning approach for histopathological whole slide image analysis by solving instance jigsaw puzzles to better capture spatial correlations, which outperforms current state-of-the-art methods in classification and survival prediction tasks.
Authors:Pinaki Prasad Guha Neogi, Ahmad Mohammadshirazi, Rajiv Ramnath
Abstract:
Traffic accidents are rare, yet high-impact events that require long-context multimodal reasoning for accurate risk forecasting. In this paper, we introduce ALCo-FM, a unified adaptive long-context foundation model that computes a volatility pre-score to dynamically select context windows for input data and encodes and fuses these multimodal data via shallow cross attention. Following a local GAT layer and a BigBird-style sparse global transformer over H3 hexagonal grids, coupled with Monte Carlo dropout for confidence, the model yields superior, well-calibrated predictions. Trained on data from 15 US cities with a class-weighted loss to counter label imbalance, and fine-tuned with minimal data on held-out cities, ALCo-FM achieves 0.94 accuracy, 0.92 F1, and an ECE of 0.04, outperforming more than 20 state-of-the-art baselines in large-scale urban risk prediction. Code and dataset are available at: https://github.com/PinakiPrasad12/ALCo-FM
中文:ALCo-FM是一种自适应长上下文基础模型,能动态选择多模态数据并通过交叉注意力融合,以最少微调在城市风险预测中实现卓越准确性和校准度。
English: ALCo-FM is an adaptive long-context foundation model that dynamically selects multimodal data and fuses them through cross attention, achieving superior accuracy and calibration in urban risk prediction with minimal fine-tuning.
Authors:Ilia Azizi, Juraj Bodik, Jakob Heiss, Bin Yu
Abstract:
Accurate uncertainty quantification is critical for reliable predictive modeling, especially in regression tasks. Existing methods typically address either aleatoric uncertainty from measurement noise or epistemic uncertainty from limited data, but not necessarily both in a balanced way. We propose CLEAR, a calibration method with two distinct parameters, $γ_1$ and $γ_2$, to combine the two uncertainty components for improved conditional coverage. CLEAR is compatible with any pair of aleatoric and epistemic estimators; we show how it can be used with (i) quantile regression for aleatoric uncertainty and (ii) ensembles drawn from the Predictability-Computability-Stability (PCS) framework for epistemic uncertainty. Across 17 diverse real-world datasets, CLEAR achieves an average improvement of 28.2% and 17.4% in the interval width compared to the two individually calibrated baselines while maintaining nominal coverage. This improvement can be particularly evident in scenarios dominated by either high epistemic or high aleatoric uncertainty.
现有方法通常单独处理偶然或认知不确定性,而CLEAR提出了一种双参数校准技术,有效结合两者,在多种数据集中显著提升了预测区间的覆盖范围和宽度。
Existing methods often handle either aleatoric or epistemic uncertainty separately, but CLEAR introduces a dual-parameter calibration technique to effectively combine both, significantly enhancing predictive interval coverage and width across diverse datasets.
Authors:Ilia Azizi, Juraj Bodik, Jakob Heiss, Bin Yu
Abstract:
Existing methods typically address either aleatoric uncertainty due to measurement noise or epistemic uncertainty resulting from limited data, but not both in a balanced manner. We propose CLEAR, a calibration method with two distinct parameters, $γ_1$ and $γ_2$, to combine the two uncertainty components and improve the conditional coverage of predictive intervals for regression tasks. CLEAR is compatible with any pair of aleatoric and epistemic estimators; we show how it can be used with (i) quantile regression for aleatoric uncertainty and (ii) ensembles drawn from the Predictability-Computability-Stability (PCS) framework for epistemic uncertainty. Across 17 diverse real-world datasets, CLEAR achieves an average improvement of 28.2% and 17.4% in the interval width compared to the two individually calibrated baselines while maintaining nominal coverage. Similar improvements are observed when applying CLEAR to Deep Ensembles (epistemic) and Simultaneous Quantile Regression (aleatoric). The benefits are especially evident in scenarios dominated by high aleatoric or epistemic uncertainty.
现有方法通常单独处理偶然或认知不确定性,而CLEAR提出了一种双参数校准技术,有效结合两者,在多种数据集中显著提升了预测区间的覆盖范围和宽度。
Existing methods often handle either aleatoric or epistemic uncertainty separately, but CLEAR introduces a dual-parameter calibration technique to effectively combine both, significantly enhancing predictive interval coverage and width across diverse datasets.
Authors:Pouria Mahdavinia, Mehrdad Mahdavi
Abstract:
Fine-tuning large foundation models presents significant memory challenges due to stateful optimizers like AdamW, often requiring several times more GPU memory than inference. While memory-efficient methods like parameter-efficient fine-tuning (e.g., LoRA) and optimizer state compression exist, recent approaches like GaLore bridge these by using low-rank gradient projections and subspace moment accumulation. However, such methods may struggle with fixed subspaces or computationally costly offline resampling (e.g., requiring full-matrix SVDs). We propose Momentum Factorized SGD (MoFaSGD), which maintains a dynamically updated low-rank SVD representation of the first-order momentum, closely approximating its full-rank counterpart throughout training. This factorization enables a memory-efficient fine-tuning method that adaptively updates the optimization subspace at each iteration. Crucially, MoFaSGD leverages the computed low-rank momentum factors to perform efficient spectrally normalized updates, offering an alternative to subspace moment accumulation. We establish theoretical convergence guarantees for MoFaSGD, proving it achieves an optimal rate for non-convex stochastic optimization under standard assumptions. Empirically, we demonstrate MoFaSGD's effectiveness on large language model alignment benchmarks, achieving a competitive trade-off between memory reduction (comparable to LoRA) and performance compared to state-of-the-art low-rank optimization methods. Our implementation is available at https://github.com/pmahdavi/MoFaSGD.
Chinese: MoFaSGD通过动态更新动量的低秩SVD表示,提出了一种内存高效的微调方法,在保持与LoRA相当的内存缩减同时,实现了最优收敛和具有竞争力的性能表现。
English: MoFaSGD introduces a memory-efficient fine-tuning method by dynamically updating a low-rank SVD representation of momentum, achieving optimal convergence and competitive performance with reduced memory usage comparable to LoRA.
Authors:Evgenii Rudakov, Jonathan Shock, Otto Lappi, Benjamin Ultan Cowley
Abstract:
This paper introduces a SSSUMO, semi-supervised deep learning approach for submovement decomposition that achieves state-of-the-art accuracy and speed. While submovement analysis offers valuable insights into motor control, existing methods struggle with reconstruction accuracy, computational cost, and validation, due to the difficulty of obtaining hand-labeled data. We address these challenges using a semi-supervised learning framework. This framework learns from synthetic data, initially generated from minimum-jerk principles and then iteratively refined through adaptation to unlabeled human movement data. Our fully convolutional architecture with differentiable reconstruction significantly surpasses existing methods on both synthetic and diverse human motion datasets, demonstrating robustness even in high-noise conditions. Crucially, the model operates in real-time (less than a millisecond per input second), a substantial improvement over optimization-based techniques. This enhanced performance facilitates new applications in human-computer interaction, rehabilitation medicine, and motor control studies. We demonstrate the model's effectiveness across diverse human-performed tasks such as steering, rotation, pointing, object moving, handwriting, and mouse-controlled gaming, showing notable improvements particularly on challenging datasets where traditional methods largely fail. Training and benchmarking source code, along with pre-trained model weights, are made publicly available at https://github.com/dolphin-in-a-coma/sssumo.
中文: 本文提出的SSSUMO是一种半监督深度学习框架,用于子运动分解,实现了顶尖的精度和实时处理能力,显著提升了人机交互和运动控制研究中的应用潜力。
English: This paper presents SSSUMO, a semi-supervised deep learning method for submovement decomposition that achieves top-tier accuracy and real-time processing, enhancing applications in human-computer interaction and motor control studies.
Authors:Aldan Creo, Raul Castro Fernandez, Manuel Cebrian
Abstract:
As large language models (LLMs) become increasingly deployed, understanding the complexity and evolution of jailbreaking strategies is critical for AI safety.
We present a mass-scale empirical analysis of jailbreak complexity across over 2 million real-world conversations from diverse platforms, including dedicated jailbreaking communities and general-purpose chatbots. Using a range of complexity metrics spanning probabilistic measures, lexical diversity, compression ratios, and cognitive load indicators, we find that jailbreak attempts do not exhibit significantly higher complexity than normal conversations. This pattern holds consistently across specialized jailbreaking communities and general user populations, suggesting practical bounds on attack sophistication. Temporal analysis reveals that while user attack toxicity and complexity remains stable over time, assistant response toxicity has decreased, indicating improving safety mechanisms. The absence of power-law scaling in complexity distributions further points to natural limits on jailbreak development.
Our findings challenge the prevailing narrative of an escalating arms race between attackers and defenders, instead suggesting that LLM safety evolution is bounded by human ingenuity constraints while defensive measures continue advancing. Our results highlight critical information hazards in academic jailbreak disclosure, as sophisticated attacks exceeding current complexity baselines could disrupt the observed equilibrium and enable widespread harm before defensive adaptation.
中文摘要:研究发现越狱攻击的复杂度并未超出正常对话,攻击模式稳定而AI防御持续提升,这对“攻防军备竞赛升级”的普遍认知提出挑战,同时警示公开复杂攻击方法可能打破现有平衡。
English Summary: The study finds jailbreak attempts show no greater complexity than normal conversations, with stable attack patterns and improving AI defenses, challenging the notion of an escalating arms race while warning against disclosing sophisticated methods that could disrupt this equilibrium.
Authors:Helen Qu, Sang Michael Xie
Abstract:
CLIP and large multimodal models (LMMs) have better accuracy on examples involving concepts that are highly represented in the training data. However, the role of concept combinations in the training data on compositional generalization is largely unclear -- for instance, how does accuracy vary when a common object appears in an uncommon pairing with another object? In this paper, we investigate how word co-occurrence statistics in the pretraining dataset (a proxy for co-occurrence of visual concepts) impacts CLIP/LMM performance. To disentangle the effects of word co-occurrence frequencies from single-word frequencies, we measure co-occurrence with pointwise mutual information (PMI), which normalizes the joint probability of two words co-occurring by the probability of co-occurring independently. Using synthetically generated images with a variety of concept pairs, we show a strong correlation between PMI in the CLIP pretraining data and zero-shot accuracy in CLIP models trained on LAION-400M (r=0.97 and 14% accuracy gap between images in the top and bottom 5% of PMI values), demonstrating that even accuracy on common concepts is affected by the combination of concepts in the image. Leveraging this finding, we reproduce this effect in natural images by editing them to contain pairs with varying PMI, resulting in a correlation of r=0.75. Finally, we demonstrate that this behavior in CLIP transfers to LMMs built on top of CLIP (r=0.70 for TextVQA, r=0.62 for VQAv2). Our findings highlight the need for algorithms and architectures that improve compositional generalization in multimodal models without scaling the training data combinatorially. Our code is available at https://github.com/helenqu/multimodal-pretraining-pmi.
Chinese: 本研究发现,CLIP及大型多模态模型的准确性受训练数据中概念共现频率的显著影响,通过点间互信息衡量,即使常见物体在异常配对时也会影响模型表现。
English: This study reveals that CLIP and large multimodal models' accuracy is strongly influenced by the co-occurrence frequency of concepts in their training data, as measured by pointwise mutual information, which affects performance even on common objects when paired unusually.
Authors:Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang
Abstract:
Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.
中文摘要:TreeBench是一个基于可追溯证据和复杂推理的诊断性基准,用于全面评估视觉接地推理能力,即使先进模型如OpenAI-o3在其挑战性任务中表现不佳,而提出的TreeVGR训练范式通过联合监督定位与推理,显著提升了多项基准的性能。
English Summary: TreeBench is a diagnostic benchmark designed to evaluate visual grounded reasoning by focusing on traceable evidence and complex reasoning, revealing that even advanced models like OpenAI-o3 struggle with its challenging tasks, while the proposed TreeVGR training paradigm significantly improves performance by integrating localization and reasoning.
Authors:Mingkai Jia, Wei Yin, Xiaotao Hu, Jiaxin Guo, Xiaoyang Guo, Qian Zhang, Xiao-Xiao Long, Ping Tan
Abstract:
Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose MGVQ, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. MGVQ achieves the state-of-the-art performance on both ImageNet and 8 zero-shot benchmarks across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID 0.49 v.s. 0.91, and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of MGVQ in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at https://github.com/MKJia/MGVQ.
中文: MGVQ通过保留潜在维度和引入多子码本增强离散码本表示能力,显著减少信息损失,在ImageNet和零样本基准测试中实现了最优重建性能。
English: MGVQ enhances VQ-VAE performance by preserving latent dimensions and using multiple sub-codebooks to reduce information loss, achieving state-of-the-art reconstruction quality on ImageNet and zero-shot benchmarks.
Authors:Shivam Duggal, Sanghyun Byun, William T. Freeman, Antonio Torralba, Phillip Isola
Abstract:
According to Algorithmic Information Theory (AIT) -- Intelligent representations compress data into the shortest possible program that can reconstruct its content, exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems use fixed-length representations for all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple encodings to find the most predictive one. Inspired by Kolmogorov Complexity principles, we propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL's training procedure closely resembles the Upside-Down Reinforcement Learning paradigm, as it learns to conditionally predict token halting based on a desired reconstruction quality. KARL matches the performance of recent adaptive tokenizers while operating in a single pass. We present scaling laws for KARL, analyzing the role of encoder/decoder size, continuous vs. discrete tokenization and more. Additionally, we offer a conceptual study drawing an analogy between Adaptive Image Tokenization and Algorithmic Information Theory, examining the predicted image complexity (KC) across axes such as structure vs. noise and in- vs. out-of-distribution familiarity -- revealing alignment with human intuition.
中文:受算法信息理论启发,KARL是一种单次前向处理的自适应分词器,能根据图像的近似柯氏复杂度动态预测最佳分词数量,在保持与多轮处理模型相当性能的同时,其复杂度预测结果与人类直觉高度吻合。
English: Inspired by Algorithmic Information Theory, KARL is a single-pass adaptive tokenizer that dynamically predicts the optimal number of tokens for images based on their approximate Kolmogorov complexity, matching the performance of multi-pass methods while aligning predicted complexity with human intuition.
Authors:Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, David Wagner
Abstract:
When large language model (LLM) systems interact with external data to perform complex tasks, a new attack, namely prompt injection, becomes a significant threat. By injecting instructions into the data accessed by the system, the attacker is able to override the initial user task with an arbitrary task directed by the attacker. To secure the system, test-time defenses, e.g., defensive prompting, have been proposed for system developers to attain security only when needed in a flexible manner. However, they are much less effective than training-time defenses that change the model parameters. Motivated by this, we propose DefensiveToken, a test-time defense with prompt injection robustness comparable to training-time alternatives. DefensiveTokens are newly inserted as special tokens, whose embeddings are optimized for security. In security-sensitive cases, system developers can append a few DefensiveTokens before the LLM input to achieve security with a minimal utility drop. In scenarios where security is less of a concern, developers can simply skip DefensiveTokens; the LLM system remains the same as there is no defense, generating high-quality responses. Thus, DefensiveTokens, if released alongside the model, allow a flexible switch between the state-of-the-art (SOTA) utility and almost-SOTA security at test time. The code is available at https://github.com/Sizhe-Chen/DefensiveToken.
中文: DefensiveToken是一种测试时防御方法,通过插入特殊优化的令牌实现与训练时防御相当的抗提示注入能力,可根据需要在最佳效用和增强安全性之间灵活切换。
English: DefensiveToken is a test-time defense method that uses specially optimized tokens to provide prompt injection robustness comparable to training-time defenses, allowing flexible switching between maximum utility and enhanced security as needed.
Authors:Karthik Garimella, Austin Ebel, Brandon Reagen
Abstract:
Fully Homomorphic Encryption (FHE) is an encryption scheme that allows for computation to be performed directly on encrypted data, effectively closing the loop on secure and outsourced computing. Data is encrypted not only during rest and transit, but also during processing. However, FHE provides a limited instruction set: SIMD addition, SIMD multiplication, and cyclic rotation of 1-D vectors. This restriction makes performing multi-dimensional tensor operations challenging. Practitioners must pack these tensors into 1-D vectors and map tensor operations onto this one-dimensional layout rather than their traditional nested structure. And while prior systems have made significant strides in automating this process, they often hide critical packing decisions behind layers of abstraction, making debugging, optimizing, and building on top of these systems difficult.
In this work, we approach multi-dimensional tensor operations in FHE through Einstein summation (einsum) notation. Einsum notation explicitly encodes dimensional structure and operations in its syntax, naturally exposing how tensors should be packed and transformed. We decompose einsum expressions into a fixed set of FHE-friendly operations. We implement our design and present EinHops, a minimalist system that factors einsum expressions into a fixed sequence of FHE operations. EinHops enables developers to perform encrypted tensor operations using FHE while maintaining full visibility into the underlying packing strategy. We evaluate EinHops on a range of tensor operations from a simple transpose to complex multi-dimensional contractions. We show that the explicit nature of einsum notation allows us to build an FHE tensor system that is simple, general, and interpretable. We open-source EinHops at the following repository: https://github.com/baahl-nyu/einhops.
Chinese: 全同态加密(FHE)允许在加密数据上直接计算,但仅限于一维操作,因此作者开发了EinHops系统,利用爱因斯坦求和符号来简化和明确FHE中的多维张量操作。
English: Fully Homomorphic Encryption (FHE) enables computation on encrypted data but is limited to one-dimensional operations, so the authors introduce EinHops, a system using Einstein summation notation to simplify and clarify multi-dimensional tensor operations in FHE.
Authors:Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
Abstract:
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8,192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames).
中文: 本文提出了一种全栈框架,通过强化学习提升视觉语言模型在长视频中的推理能力,包含专用数据集、两阶段训练流程和高效基础设施,实现了卓越性能并最高提速2.1倍。
English: This paper introduces a full-stack framework that enhances vision-language models' reasoning for long videos through reinforcement learning, featuring a specialized dataset, two-stage training pipeline, and efficient infrastructure, achieving superior performance and up to 2.1x training speedup.
Authors:Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
Abstract:
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8,192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames).
中文: 本文提出了一种全栈框架,通过强化学习提升视觉语言模型在长视频中的推理能力,包含专用数据集、两阶段训练流程和高效基础设施,实现了卓越性能并最高提速2.1倍。
English: This paper introduces a full-stack framework that enhances vision-language models' reasoning for long videos through reinforcement learning, featuring a specialized dataset, two-stage training pipeline, and efficient infrastructure, achieving superior performance and up to 2.1x training speedup.
Authors:Yuxin Bai, Cecelia Shuai, Ashwin De Silva, Siyu Yu, Pratik Chaudhari, Joshua T. Vogelstein
Abstract:
In most real-world applications of artificial intelligence, the distributions of the data and the goals of the learners tend to change over time. The Probably Approximately Correct (PAC) learning framework, which underpins most machine learning algorithms, fails to account for dynamic data distributions and evolving objectives, often resulting in suboptimal performance. Prospective learning is a recently introduced mathematical framework that overcomes some of these limitations. We build on this framework to present preliminary results that improve the algorithm and numerical results, and extend prospective learning to sequential decision-making scenarios, specifically foraging. Code is available at: https://github.com/neurodata/prolearn2.
中文摘要:本研究改进了前瞻性学习框架,以应对人工智能中动态数据和不断变化的目标,优化了算法并将其扩展到如觅食等顺序决策场景中。
English Summary: The study enhances the prospective learning framework to address dynamic data and evolving goals in AI, improving algorithms and extending its application to sequential decision-making like foraging.
Authors:Sizhen Bian, Mengxi Liu, Vitor Fortes Rey, Daniel Geissler, Paul Lukowicz
Abstract:
Human Activity Recognition (HAR) on resource-constrained wearable devices demands inference models that harmonize accuracy with computational efficiency. This paper introduces TinierHAR, an ultra-lightweight deep learning architecture that synergizes residual depthwise separable convolutions, gated recurrent units (GRUs), and temporal aggregation to achieve SOTA efficiency without compromising performance. Evaluated across 14 public HAR datasets, TinierHAR reduces Parameters by 2.7x (vs. TinyHAR) and 43.3x (vs. DeepConvLSTM), and MACs by 6.4x and 58.6x, respectively, while maintaining the averaged F1-scores. Beyond quantitative gains, this work provides the first systematic ablation study dissecting the contributions of spatial-temporal components across proposed TinierHAR, prior SOTA TinyHAR, and the classical DeepConvLSTM, offering actionable insights for designing efficient HAR systems. We finally discussed the findings and suggested principled design guidelines for future efficient HAR. To catalyze edge-HAR research, we open-source all materials in this work for future benchmarking\footnote{https://github.com/zhaxidele/TinierHAR}
中文:本文提出TinierHAR超轻量深度学习模型,通过时空组件的创新融合,在保持性能的同时实现了最优计算效率,并在多个数据集上得到验证。
English: This paper presents TinierHAR, an ultra-lightweight deep learning model that achieves state-of-the-art computational efficiency while maintaining performance through innovative integration of spatial-temporal components, validated across multiple datasets.
Authors:Guoxin Zang, Xue Li, Donglin Di, Lanshun Nie, Dechen Zhan, Yang Song, Lei Fan
Abstract:
While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at https://github.com/amoreZgx1n/SAGE.
中文:SAGE框架通过自引导事实增强整合领域知识,并利用熵感知直接偏好优化使模型输出符合专家偏好,有效解决了视觉语言模型在工业异常检测中的泛化与解释性不足问题,在零样本和单样本场景下表现优异。
English: The SAGE framework addresses the limitations of Vision-Language Models in industrial anomaly detection by integrating domain-specific knowledge through Self-Guided Fact Enhancement and aligning outputs with expert preferences via Entropy-aware Direct Preference Optimization, demonstrating superior performance in zero-shot and one-shot settings.
Authors:Suman Adhya, Debarshi Kumar Sanyal
Abstract:
The explosive growth of textual data over time presents a significant challenge in uncovering evolving themes and trends. Existing dynamic topic modeling techniques, while powerful, often exist in fragmented pipelines that lack robust support for interpretation and user-friendly exploration. We introduce DTECT (Dynamic Topic Explorer & Context Tracker), an end-to-end system that bridges the gap between raw textual data and meaningful temporal insights. DTECT provides a unified workflow that supports data preprocessing, multiple model architectures, and dedicated evaluation metrics to analyze the topic quality of temporal topic models. It significantly enhances interpretability by introducing LLM-driven automatic topic labeling, trend analysis via temporally salient words, interactive visualizations with document-level summarization, and a natural language chat interface for intuitive data querying. By integrating these features into a single, cohesive platform, DTECT empowers users to more effectively track and understand thematic dynamics. DTECT is open-source and available at https://github.com/AdhyaSuman/DTECT.
中文: DTECT是一个端到端系统,集成了数据预处理、动态主题建模及交互式功能(如大语言模型驱动的自动标注和自然语言查询),帮助用户有效追踪文本数据中的主题演变。
English: DTECT is an end-to-end system that integrates data preprocessing, dynamic topic modeling, and interactive features like LLM-driven labeling and natural language querying to help users effectively track evolving themes in textual data.
Authors:Jinhong Wang, Tajamul Ashraf, Zongyan Han, Jorma Laaksonen, Rao Mohammad Anwer
Abstract:
Multimodal Large Language Models (MLLMs) have significantly advanced AI-assisted medical diagnosis, but they often generate factually inconsistent responses that deviate from established medical knowledge. Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external sources, but it presents two key challenges. First, insufficient retrieval can miss critical information, whereas excessive retrieval can introduce irrelevant or misleading content, disrupting model output. Second, even when the model initially provides correct answers, over-reliance on retrieved data can lead to factual errors. To address these issues, we introduce the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM. MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning. This enables the model to effectively integrate both its inherent knowledge and external references. Our evaluation of publicly available medical VQA and report generation benchmarks demonstrates that MIRA substantially enhances factual accuracy and overall performance, achieving new state-of-the-art results. Code is released at https://github.com/mbzuai-oryx/MIRA.
中文: MIRA框架通过动态优化检索过程和整合多模态医学知识,解决了多模态大语言模型的事实不一致性问题,实现了最先进的诊断准确性。
English: The MIRA framework addresses factual inaccuracies in Multimodal Large Language Models by dynamically optimizing retrieval processes and integrating multimodal medical knowledge, achieving state-of-the-art diagnostic accuracy.
Authors:Hao Ban, Gokul Ram Subramani, Kaiyi Ji
Abstract:
Multi-task learning (MTL) enables a joint model to capture commonalities across multiple tasks, reducing computation costs and improving data efficiency. However, a major challenge in MTL optimization is task conflicts, where the task gradients differ in direction or magnitude, limiting model performance compared to single-task counterparts. Sharpness-aware minimization (SAM) minimizes task loss while simultaneously reducing the sharpness of the loss landscape. Our empirical observations show that SAM effectively mitigates task conflicts in MTL. Motivated by these findings, we explore integrating SAM into MTL but face two key challenges. While both the average loss gradient and individual task gradients-referred to as global and local information-contribute to SAM, how to combine them remains unclear. Moreover, directly computing each task gradient introduces significant computational and memory overheads. To address these challenges, we propose SAMO, a lightweight \textbf{S}harpness-\textbf{A}ware \textbf{M}ulti-task \textbf{O}ptimization approach, that leverages a joint global-local perturbation. The local perturbations are approximated using only forward passes and are layerwise normalized to improve efficiency. Extensive experiments on a suite of multi-task benchmarks demonstrate both the effectiveness and efficiency of our method. Code is available at https://github.com/OptMN-Lab/SAMO.
中文: 多任务学习面临任务冲突限制性能,但提出的SAMO方法通过轻量级方式有效结合全局与局部梯度信息,高效缓解了这些问题。
English: Multi-task learning faces task conflicts that limit performance, but the proposed SAMO method effectively combines global and local gradient information with a lightweight approach to mitigate these issues efficiently.
Authors:Pierre Marza, Leo Fillioux, Sofiène Boutaj, Kunal Mahatha, Christian Desrosiers, Pablo Piantanida, Jose Dolz, Stergios Christodoulidis, Maria Vakalopoulou
Abstract:
Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at https://github.com/MICS-Lab/thunder.
中文: THUNDER 是一个用于数字病理学的动态基准,能够跨多种数据集高效比较基础模型,评估其性能、特征空间、鲁棒性和不确定性,以推动医疗领域模型评估的可靠性。
English: THUNDER is a dynamic benchmark for digital pathology that enables efficient comparison of foundation models across diverse datasets, assessing performance, feature spaces, robustness, and uncertainty to advance reliable model evaluation in healthcare.
Authors:Yuchen Zhu, Cheng Shi, Dingyou Wang, Jiajin Tang, Zhengxuan Wei, Yu Wu, Guanbin Li, Sibei Yang
Abstract:
Class-incremental/Continual image segmentation (CIS) aims to train an image segmenter in stages, where the set of available categories differs at each stage. To leverage the built-in objectness of query-based transformers, which mitigates catastrophic forgetting of mask proposals, current methods often decouple mask generation from the continual learning process. This study, however, identifies two key issues with decoupled frameworks: loss of plasticity and heavy reliance on input data order. To address these, we conduct an in-depth investigation of the built-in objectness and find that highly aggregated image features provide a shortcut for queries to generate masks through simple feature alignment. Based on this, we propose SimCIS, a simple yet powerful baseline for CIS. Its core idea is to directly select image features for query assignment, ensuring "perfect alignment" to preserve objectness, while simultaneously allowing queries to select new classes to promote plasticity. To further combat catastrophic forgetting of categories, we introduce cross-stage consistency in selection and an innovative "visual query"-based replay mechanism. Experiments demonstrate that SimCIS consistently outperforms state-of-the-art methods across various segmentation tasks, settings, splits, and input data orders. All models and codes will be made publicly available at https://github.com/SooLab/SimCIS.
中文: 本文提出SimCIS这一新颖的类增量图像分割框架,通过直接特征对齐保持对象性,结合跨阶段一致性和视觉查询回放机制增强可塑性,在多种设置下均优于现有方法。
English: This paper introduces SimCIS, a novel framework for class-incremental image segmentation that preserves objectness through direct feature alignment and enhances plasticity with cross-stage consistency and a visual query replay mechanism, outperforming existing methods across various settings.
Authors:Anwoy Chatterjee, H S V N S Kowndinya Renduchintala, Sumit Bhatia, Tanmoy Chakraborty
Abstract:
Instruction Tuning has emerged as a pivotal post-training paradigm that enables pre-trained language models to better follow user instructions. Despite its significance, little attention has been given to optimizing the loss function used. A fundamental, yet often overlooked, question is whether the conventional auto-regressive objective - where loss is computed only on response tokens, excluding prompt tokens - is truly optimal for instruction tuning. In this work, we systematically investigate the impact of differentially weighting prompt and response tokens in instruction tuning loss, and propose Weighted Instruction Tuning (WIT) as a better alternative to conventional instruction tuning. Through extensive experiments on five language models of different families and scale, three finetuning datasets of different sizes, and five diverse evaluation benchmarks, we show that the standard instruction tuning loss often yields suboptimal performance and limited robustness to input prompt variations. We find that a low-to-moderate weight for prompt tokens coupled with a moderate-to-high weight for response tokens yields the best-performing models across settings and also serve as better starting points for the subsequent preference alignment training. These findings highlight the need to reconsider instruction tuning loss and offer actionable insights for developing more robust and generalizable models. Our code is open-sourced at https://github.com/kowndinya-renduchintala/WIT.
中文: 本研究提出加权指令调优(WIT),通过实验证明在损失函数中对提示词和响应词进行差异化加权,能显著提升模型性能与鲁棒性,优于传统指令调优方法。
English: This research introduces Weighted Instruction Tuning (WIT), demonstrating that differentially weighting prompt and response tokens in the loss function outperforms conventional instruction tuning by enhancing model performance and robustness across diverse settings.
Authors:Shoutao Guo, Xiang Li, Mengge Liu, Wei Chen, Yang Feng
Abstract:
Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate multi-stage outputs. Leveraging these multi-stage outputs, StreamUni simultaneously accomplishes speech segmentation, policy decision, and translation generation, completing StreamST without requiring massive policy-specific training. Additionally, we propose a streaming CoT training method that enhances low-latency policy decisions and generation capabilities using limited CoT data. Experiments demonstrate that our approach achieves state-of-the-art performance on StreamST tasks.
中文:StreamUni通过统一的大型语音语言模型,利用语音思维链实现流式语音翻译,同时完成语音分割、策略决策和翻译生成,无需大量策略训练,取得了领先性能。
English: StreamUni introduces a unified Large Speech-Language Model that utilizes speech Chain-of-Thought to perform streaming speech translation by simultaneously handling segmentation, policy decisions, and translation generation without extensive policy training, achieving state-of-the-art results.
Authors:Jiaxin Huang, Ziwen Li, Hanlve Zhang, Runnan Chen, Xiao He, Yandong Guo, Wenping Wang, Tongliang Liu, Mingming Gong
Abstract:
The integration of language and 3D perception is critical for embodied AI and robotic systems to perceive, understand, and interact with the physical world. Spatial reasoning, a key capability for understanding spatial relationships between objects, remains underexplored in current 3D vision-language research. Existing datasets often mix semantic cues (e.g., object name) with spatial context, leading models to rely on superficial shortcuts rather than genuinely interpreting spatial relationships. To address this gap, we introduce S\textsc{urprise}3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes. S\textsc{urprise}3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2, including more than 2.8k unique object classes. The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name, thereby mitigating shortcut biases in spatial understanding. These queries comprehensively cover various spatial reasoning skills, such as relative position, narrative perspective, parametric perspective, and absolute distance reasoning. Initial benchmarks demonstrate significant challenges for current state-of-the-art expert 3D visual grounding methods and 3D-LLMs, underscoring the necessity of our dataset and the accompanying 3D Spatial Reasoning Segmentation (3D-SRS) benchmark suite. S\textsc{urprise}3D and 3D-SRS aim to facilitate advancements in spatially aware AI, paving the way for effective embodied interaction and robotic planning. The code and datasets can be found in https://github.com/liziwennba/SUPRISE.
中文: S\textsc{urprise}3D数据集通过提供超过20万个排除物体名称的人类标注空间查询视觉语言对,弥补了3D视觉语言研究中的空白,挑战现有模型真正理解空间关系,推动空间感知人工智能在具身交互中的发展。
English: The S\textsc{urprise}3D dataset addresses the gap in 3D vision-language research by providing over 200k vision-language pairs with human-annotated spatial queries that exclude object names, challenging current models to genuinely interpret spatial relationships and advancing spatially aware AI for embodied interaction.
Authors:Mélanie Roschewitz, Raghav Mehta, Fabio de Sousa Ribeiro, Ben Glocker
Abstract:
We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and in-training calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains. We find that: (i) simultaneously applying entropy regularisation and label smoothing yield the best calibrated raw probabilities under dataset shift, (ii) post-hoc calibrators exposed to a small amount of semantic out-of-distribution data (unrelated to the task) are most robust under shift, (iii) recent calibration methods specifically aimed at increasing calibration under shifts do not necessarily offer significant improvements over simpler post-hoc calibration methods, (iv) improving calibration under shifts often comes at the cost of worsening in-distribution calibration. Importantly, these findings hold for randomly initialised classifiers, as well as for those finetuned from foundation models, the latter being consistently better calibrated compared to models trained from scratch. Finally, we conduct an in-depth analysis of ensembling effects, finding that (i) applying calibration prior to ensembling (instead of after) is more effective for calibration under shifts, (ii) for ensembles, OOD exposure deteriorates the ID-shifted calibration trade-off, (iii) ensembling remains one of the most effective methods to improve calibration robustness and, combined with finetuning from foundation models, yields best calibration results overall.
中文摘要:本研究通过图像分类在真实数据集偏移下的校准分析发现,熵正则化与标签平滑结合能产生最佳校准概率,使用分布外数据的后处理校准方法更具鲁棒性,而集成学习与基础模型微调相结合可实现最优校准效果。
English Summary: This comprehensive study on image classification calibration under real-world dataset shift reveals that combining entropy regularization with label smoothing produces the best-calibrated probabilities, while post-hoc methods using out-of-distribution data show superior robustness, with ensemble methods and foundation model fine-tuning delivering optimal calibration performance.
Authors:Dren Fazlija, Monty-Maximilian Zühlke, Johanna Schrader, Arkadij Orlov, Clara Stein, Iyiola E. Olatunji, Daniel Kudenko
Abstract:
Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.
中文: 无限制对抗攻击通过改变颜色等特征规避传统防御,但缺乏人类不可感知性的保证,为此开发了SCOOTER框架,该框架通过大规模人类研究评估此类攻击,并揭示自动化视觉系统与人类感知之间存在偏差。
English: Unrestricted adversarial attacks bypass traditional defenses by altering features like color, but lack guarantees of human imperceptibility, leading to the development of SCOOTER, a framework that evaluates such attacks through large-scale human studies and reveals a misalignment between automated vision systems and human perception.
Authors:Peizhang Shao, Linrui Xu, Jinxi Wang, Wei Zhou, Xingyu Wu
Abstract:
This paper establishes the first comprehensive review of Large Language Models (LLMs) applied within the legal domain. It pioneers an innovative dual lens taxonomy that integrates legal reasoning frameworks and professional ontologies to systematically unify historical research and contemporary breakthroughs. Transformer-based LLMs, which exhibit emergent capabilities such as contextual reasoning and generative argumentation, surmount traditional limitations by dynamically capturing legal semantics and unifying evidence reasoning. Significant progress is documented in task generalization, reasoning formalization, workflow integration, and addressing core challenges in text processing, knowledge integration, and evaluation rigor via technical innovations like sparse attention mechanisms and mixture-of-experts architectures. However, widespread adoption of LLM introduces critical challenges: hallucination, explainability deficits, jurisdictional adaptation difficulties, and ethical asymmetry. This review proposes a novel taxonomy that maps legal roles to NLP subtasks and computationally implements the Toulmin argumentation framework, thus systematizing advances in reasoning, retrieval, prediction, and dispute resolution. It identifies key frontiers including low-resource systems, multimodal evidence integration, and dynamic rebuttal handling. Ultimately, this work provides both a technical roadmap for researchers and a conceptual framework for practitioners navigating the algorithmic future, laying a robust foundation for the next era of legal artificial intelligence. We have created a GitHub repository to index the relevant papers: https://github.com/Kilimajaro/LLMs_Meet_Law.
中文摘要:本文首次对大型语言模型在法律领域的应用进行全面综述,通过整合法律推理框架与专业本体提出创新分类法,系统梳理技术进展并应对幻觉、可解释性等核心挑战。
English Summary: This paper presents the first comprehensive review of Large Language Models in legal applications, introducing a novel taxonomy that integrates legal reasoning with professional frameworks to systematize advances while addressing challenges like hallucination and ethical concerns.
Authors:David Pujol-Perich, Sergio Escalera, Albert Clapés
Abstract:
Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning -- and particularly side-tuning (ST) -- has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention -- a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods. The code is publicly accessible at https://github.com/davidpujol/SDST.
中文: 提出的稀疏-密集侧调器(SDST)采用无锚点侧调架构和新型注意力机制,在极大减少参数的同时显著提升了视频时序定位性能。
English: The proposed Sparse-Dense Side-Tuner (SDST) introduces an anchor-free side-tuning architecture with novel attention mechanisms that significantly enhance video temporal grounding performance while drastically reducing parameter requirements.
Authors:Zhijin Dong
Abstract:
Post-training alignment of large language models (LLMs) is a critical challenge, as not all tokens contribute equally to model performance. This paper introduces a selective alignment strategy that prioritizes high-impact tokens within preference pairs, leveraging token-level log-probability differences between the current policy and a reference model. By focusing on these informative tokens, our approach reduces computational overhead and enhances alignment fidelity. We further explore the role of reference model quality, demonstrating that stronger reference models significantly improve token selection accuracy and overall optimization effectiveness. Comprehensive experiments on benchmarks such as Arena-Hard and MT-Bench validate the superiority of our Selective-DPO method over standard DPO and distillation-based baselines. Our findings highlight the importance of token-level optimization and reference model selection in advancing preference alignment for LLMs. The code is available at https://github.com/Dongzhijin/SDPO.
中文摘要:本文提出一种针对大语言模型的选择性对齐策略,通过利用当前策略与参考模型之间的词元级对数概率差异来优化高影响力词元,在降低计算成本的同时借助更优的参考模型提升对齐效果。
English Summary: This paper proposes a selective alignment strategy for large language models that optimizes high-impact tokens using token-level log-probability differences, reducing computational costs while improving alignment performance through better reference model selection.
Authors:Ethan Dack, Chengliang Dai
Abstract:
Recent works have revisited the infamous task ``Name That Dataset'', demonstrating that non-medical datasets contain underlying biases and that the dataset origin task can be solved with high accuracy. In this work, we revisit the same task applied to popular open-source chest X-ray datasets. Medical images are naturally more difficult to release for open-source due to their sensitive nature, which has led to certain open-source datasets being extremely popular for research purposes. By performing the same task, we wish to explore whether dataset bias also exists in these datasets. To extend our work, we apply simple transformations to the datasets, repeat the same task, and perform an analysis to identify and explain any detected biases. Given the importance of AI applications in medical imaging, it's vital to establish whether modern methods are taking shortcuts or are focused on the relevant pathology. We implement a range of different network architectures on the datasets: NIH, CheXpert, MIMIC-CXR and PadChest. We hope this work will encourage more explainable research being performed in medical imaging and the creation of more open-source datasets in the medical domain. Our code can be found here: https://github.com/eedack01/x_ray_ds_bias.
中文: 本研究通过应用“数据集识别”任务及变换方法,探究了常用开源胸部X光数据集中的偏差问题,旨在验证AI模型是否依赖捷径而非病理特征,并在NIH、CheXpert、MIMIC-CXR和PadChest数据集上采用了多种网络架构进行分析。
English: This study investigates dataset bias in popular open-source chest X-ray datasets by applying the "Name That Dataset" task and transformations, aiming to determine if AI models rely on shortcuts rather than pathology, using multiple network architectures on NIH, CheXpert, MIMIC-CXR, and PadChest datasets.
Authors:Wei Shang, Dongwei Ren, Wanying Zhang, Pengfei Zhu, Qinghua Hu, Wangmeng Zuo
Abstract:
Local motion blur in digital images originates from the relative motion between dynamic objects and static imaging systems during exposure. Existing deblurring methods face significant challenges in addressing this problem due to their inefficient allocation of computational resources and inadequate handling of spatially varying blur patterns. To overcome these limitations, we first propose a trainable mask predictor that identifies blurred regions in the image. During training, we employ blur masks to exclude sharp regions. For inference optimization, we implement structural reparameterization by converting $3\times 3$ convolutions to computationally efficient $1\times 1$ convolutions, enabling pixel-level pruning of sharp areas to reduce computation. Second, we develop an intra-frame motion analyzer that translates relative pixel displacements into motion trajectories, establishing adaptive guidance for region-specific blur restoration. Our method is trained end-to-end using a combination of reconstruction loss, reblur loss, and mask loss guided by annotated blur masks. Extensive experiments demonstrate superior performance over state-of-the-art methods on both local and global blur datasets while reducing FLOPs by 49\% compared to SOTA models (e.g., LMD-ViT). The source code is available at https://github.com/shangwei5/M2AENet.
Chinese: 本文提出了一种高效去模糊方法,通过可训练的掩码预测器识别模糊区域和帧内运动分析器指导恢复,在降低49%计算成本的同时实现了卓越性能。
English: This paper introduces an efficient deblurring method that uses a trainable mask predictor to identify blurred regions and an intra-frame motion analyzer to guide restoration, achieving superior performance while reducing computational costs by 49%.
Authors:Feng Liu, Lingna Gu, Chen Shi, Xiaolan Fu
Abstract:
Dynamic Facial Expression Recognition(DFER) is a rapidly evolving field of research that focuses on the recognition of time-series facial expressions. While previous research on DFER has concentrated on feature learning from a deep learning perspective, we put forward an AU-enhanced Dynamic Facial Expression Recognition architecture, namely AU-DFER, that incorporates AU-expression knowledge to enhance the effectiveness of deep learning modeling. In particular, the contribution of the Action Units(AUs) to different expressions is quantified, and a weight matrix is designed to incorporate a priori knowledge. Subsequently, the knowledge is integrated with the learning outcomes of a conventional deep learning network through the introduction of AU loss. The design is incorporated into the existing optimal model for dynamic expression recognition for the purpose of validation. Experiments are conducted on three recent mainstream open-source approaches to DFER on the principal datasets in this field. The results demonstrate that the proposed architecture outperforms the state-of-the-art(SOTA) methods without the need for additional arithmetic and generally produces improved results. Furthermore, we investigate the potential of AU loss function redesign to address data label imbalance issues in established dynamic expression datasets. To the best of our knowledge, this is the first attempt to integrate quantified AU-expression knowledge into various DFER models. We also devise strategies to tackle label imbalance, or minor class problems. Our findings suggest that employing a diverse strategy of loss function design can enhance the effectiveness of DFER. This underscores the criticality of addressing data imbalance challenges in mainstream datasets within this domain. The source code is available at https://github.com/Cross-Innovation-Lab/AU-DFER.
中文摘要:AU-DFER架构通过设计权重矩阵和AU损失函数,将量化的动作单元知识融入深度学习模型,在提升动态表情识别性能的同时有效解决了数据标签不平衡问题,其表现优于现有最优方法。
English Summary: The AU-DFER architecture enhances dynamic facial expression recognition by integrating quantified Action Unit knowledge with deep learning through a designed weight matrix and AU loss, demonstrating superior performance over existing methods while addressing data imbalance issues.
Authors:Domenik Eichhorn, Nick Poser, Maximilian Schweikart, Ina Schaefer
Abstract:
Hybrid solvers for combinatorial optimization problems combine the advantages of classical and quantum computing to overcome difficult computational challenges. Although their theoretical performance seems promising, their practical applicability is challenging due to the lack of a technological stack that can seamlessly integrate quantum solutions with existing classical optimization frameworks. We tackle this challenge by introducing the ProvideQ toolbox, a software tool that enables users to easily adapt and configure hybrid solvers via Meta-Solver strategies. A Meta-Solver strategy implements decomposition techniques, which splits problems into classical and quantum subroutines. The ProvideQ toolbox enables the interactive creation of such decompositions via a Meta-Solver configuration tool. It combines well-established classical optimization techniques with quantum circuits that are seamlessly executable on multiple backends. This paper introduces the technical details of the ProvideQ toolbox, explains its architecture, and demonstrates possible applications for several real-world use cases. Our proof of concept shows that Meta-Solver strategies already enable the application of quantum subroutines today, however, more sophisticated hardware is required to make their performance competitive.
中文: ProvideQ工具箱通过元求解器策略融合经典与量子计算,为组合优化问题提供实用混合解决方案,尽管当前硬件性能仍有待提升。
English: The ProvideQ toolbox introduces Meta-Solver strategies to bridge classical and quantum computing for combinatorial optimization, enabling practical hybrid solutions despite current hardware limitations.
Authors:Federico Del Pup, Riccardo Brun, Filippo Iotti, Edoardo Paccagnella, Mattia Pezzato, Sabrina Bertozzo, Andrea Zanola, Louis Fabrice Tshimanga, Henning Müller, Manfredo Atzori
Abstract:
Electroencephalography (EEG) is establishing itself as an important, low-cost, noninvasive diagnostic tool for the early detection of Parkinson's Disease (PD). In this context, EEG-based Deep Learning (DL) models have shown promising results due to their ability to discover highly nonlinear patterns within the signal. However, current state-of-the-art DL models suffer from poor generalizability caused by high inter-subject variability. This high variability underscores the need for enhancing model generalizability by developing new architectures better tailored to EEG data. This paper introduces TransformEEG, a hybrid Convolutional-Transformer designed for Parkinson's disease detection using EEG data. Unlike transformer models based on the EEGNet structure, TransformEEG incorporates a depthwise convolutional tokenizer. This tokenizer is specialized in generating tokens composed by channel-specific features, which enables more effective feature mixing within the self-attention layers of the transformer encoder. To evaluate the proposed model, four public datasets comprising 290 subjects (140 PD patients, 150 healthy controls) were harmonized and aggregated. A 10-outer, 10-inner Nested-Leave-N-Subjects-Out (N-LNSO) cross-validation was performed to provide an unbiased comparison against seven other consolidated EEG deep learning models. TransformEEG achieved the highest balanced accuracy's median (78.45%) as well as the lowest interquartile range (6.37%) across all the N-LNSO partitions. When combined with data augmentation and threshold correction, median accuracy increased to 80.10%, with an interquartile range of 5.74%. In conclusion, TransformEEG produces more consistent and less skewed results. It demonstrates a substantial reduction in variability and more reliable PD detection using EEG data compared to the other investigated models.
中文: 脑电图结合深度学习为帕金森病早期检测提供了有前景的无创方法,但现有模型因受试者间高变异性面临泛化挑战。TransformEEG这一新型混合卷积-Transformer架构通过生成通道特定令牌改进特征融合,在多个数据集上相比现有模型实现了更高的准确率和稳定性。
English: Electroencephalography (EEG) combined with deep learning offers a promising non-invasive method for early Parkinson's Disease detection, though current models face generalizability challenges due to high inter-subject variability. TransformEEG, a novel hybrid Convolutional-Transformer architecture, addresses this by generating channel-specific tokens for improved feature mixing, achieving superior accuracy and consistency across multiple datasets compared to existing models.
Authors:Marc Lafon, Yannis Karmim, Julio Silva-Rodríguez, Paul Couairon, Clément Rambour, Raphaël Fournier-Sniehotta, Ismail Ben Ayed, Jose Dolz, Nicolas Thome
Abstract:
Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: https://github.com/ykrmm/ViLU.
中文: ViLU是一种新颖的视觉语言不确定性量化框架,通过整合多模态表示并训练一个与损失无关的二元分类器作为不确定性预测器,显著提升了故障预测能力,在多个数据集上展现出卓越性能。
English: ViLU is a novel vision-language uncertainty quantification framework that improves failure prediction by integrating multimodal representations and training an uncertainty predictor as a loss-agnostic binary classifier, demonstrating superior performance across diverse datasets.
Authors:Ruixiang Chen, Guolei Sun, Yawei Li, Jie Qin, Luca Benini
Abstract:
This paper presents enhancements to the SAM2 framework for video object tracking task, addressing challenges such as occlusions, background clutter, and target reappearance. We introduce a hierarchical motion estimation strategy, combining lightweight linear prediction with selective non-linear refinement to improve tracking accuracy without requiring additional training. In addition, we optimize the memory bank by distinguishing long-term and short-term memory frames, enabling more reliable tracking under long-term occlusions and appearance changes. Experimental results show consistent improvements across different model scales. Our method achieves state-of-the-art performance on LaSOT and LaSOText with the large model, achieving 9.6% and 7.2% relative improvements in AUC over the original SAM2, and demonstrates even larger relative gains on smaller models, highlighting the effectiveness of our trainless, low-overhead improvements for boosting long-term tracking performance. The code is available at https://github.com/LouisFinner/HiM2SAM.
Chinese: 本文通过引入分层运动估计策略和优化记忆库,改进了SAM2框架的视频目标跟踪能力,在无需额外训练的情况下,在多个基准测试中实现了最先进的性能。
English: This paper enhances the SAM2 framework for video object tracking by introducing a hierarchical motion estimation strategy and an optimized memory bank, achieving state-of-the-art performance on benchmarks without requiring additional training.
Authors:Kuiyuan Sun, Yuxuan Zhang, Jichao Zhang, Jiaming Liu, Wei Wang, Niculae Sebe, Yao Zhao
Abstract:
While diffusion-based methods have shown impressive capabilities in capturing diverse and complex hairstyles, their ability to generate consistent and high-quality multi-view outputs -- crucial for real-world applications such as digital humans and virtual avatars -- remains underexplored. In this paper, we propose Stable-Hair v2, a novel diffusion-based multi-view hair transfer framework. To the best of our knowledge, this is the first work to leverage multi-view diffusion models for robust, high-fidelity, and view-consistent hair transfer across multiple perspectives. We introduce a comprehensive multi-view training data generation pipeline comprising a diffusion-based Bald Converter, a data-augment inpainting model, and a face-finetuned multi-view diffusion model to generate high-quality triplet data, including bald images, reference hairstyles, and view-aligned source-bald pairs. Our multi-view hair transfer model integrates polar-azimuth embeddings for pose conditioning and temporal attention layers to ensure smooth transitions between views. To optimize this model, we design a novel multi-stage training strategy consisting of pose-controllable latent IdentityNet training, hair extractor training, and temporal attention training. Extensive experiments demonstrate that our method accurately transfers detailed and realistic hairstyles to source subjects while achieving seamless and consistent results across views, significantly outperforming existing methods and establishing a new benchmark in multi-view hair transfer. Code is publicly available at https://github.com/sunkymepro/StableHairV2.
中文摘要:Stable-Hair v2提出了一种基于扩散模型的新型多视角发型迁移框架,通过极坐标嵌入和时序注意力层实现跨视角的连贯高质量发型转换,其性能显著优于现有方法。
English Summary: Stable-Hair v2 introduces a novel diffusion-based framework that achieves robust, high-fidelity multi-view hair transfer through innovative pose conditioning and temporal attention mechanisms, significantly outperforming existing methods.
Authors:Chunyan Wang, Dong Zhang, Jinhui Tang
Abstract:
Weakly-supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well-lit scenarios, their performance significantly degrades in low-light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo-labels, ultimately compromising the model's ability to learn discriminative feature representations. To address these problems, we propose Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation (DGKD-WLSS), a novel framework that synergistically combines Diffusion-Guided Knowledge Distillation (DGKD) with Depth-Guided Feature Fusion (DGF2). DGKD aligns normal-light and low-light features via diffusion-based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD-WLSS, which achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions. The source codes have been released at:https://github.com/ChunyanWang1/DGKD-WLSS.
中文: 提出的DGKD-WLSS框架通过结合扩散引导的知识蒸馏与深度引导特征融合,有效提升了弱光环境下弱监督语义分割的性能,达到了当前最优水平。
English: The proposed DGKD-WLSS framework enhances weakly-supervised semantic segmentation in low-light conditions by combining diffusion-guided knowledge distillation with depth-guided feature fusion, achieving state-of-the-art performance.
Authors:Joelle Hanna, Linus Scheibenreif, Damian Borth
Abstract:
Remote sensing data is commonly used for tasks such as flood mapping, wildfire detection, or land-use studies. For each task, scientists carefully choose appropriate modalities or leverage data from purpose-built instruments. Recent work on remote sensing foundation models pre-trains computer vision models on large amounts of remote sensing data. These large-scale models tend to focus on specific modalities, often optical RGB or multispectral data. For many important applications, this introduces a mismatch between the application modalities and the pre-training data. Moreover, the large size of foundation models makes them expensive and difficult to fine-tune on typically small datasets for each task. We address this mismatch with MAPEX, a remote sensing foundation model based on mixture-of-modality experts. MAPEX is pre-trained on multi-modal remote sensing data with a novel modality-conditioned token routing mechanism that elicits modality-specific experts. To apply the model on a specific task, we propose a modality aware pruning technique, which only retains experts specialized for the task modalities. This yields efficient modality-specific models while simplifying fine-tuning and deployment for the modalities of interest. We experimentally validate MAPEX on diverse remote sensing datasets and show strong performance compared to fully supervised training and state-of-the-art remote sensing foundation models. Code is available at https://github.com/HSG-AIML/MAPEX.
中文摘要:MAPEX是一种基于模态专家混合的遥感基础模型,通过新颖的模态条件令牌路由机制处理多模态数据,并采用模态感知剪枝技术生成高效的任务专用模型,在多种遥感数据集上表现出优越性能。
English Summary: MAPEX is a remote sensing foundation model that uses a mixture-of-modality experts approach to address the modality mismatch between pre-training data and application needs, enabling efficient task-specific adaptation through modality-aware pruning.
Authors:Cunhang Fan, Sheng Zhang, Jingjing Zhang, Enrui Liu, Xinhui Li, Gangming Zhao, Zhao Lv
Abstract:
Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and information retention in long-sequence decoding. To address this issue, this paper proposes the Dynamic Multiscale Fusion Network (DMF2Mel), which consists of four core components: the Dynamic Contrastive Feature Aggregation Module (DC-FAM), the Hierarchical Attention-Guided Multi-Scale Network (HAMS-Net), the SplineMap attention mechanism, and the bidirectional state space module (convMamba). Specifically, the DC-FAM separates speech-related "foreground features" from noisy "background features" through local convolution and global attention mechanisms, effectively suppressing interference and enhancing the representation of transient signals. HAMS-Net, based on the U-Net framework,achieves cross-scale fusion of high-level semantics and low-level details. The SplineMap attention mechanism integrates the Adaptive Gated Kolmogorov-Arnold Network (AGKAN) to combine global context modeling with spline-based local fitting. The convMamba captures long-range temporal dependencies with linear complexity and enhances nonlinear dynamic modeling capabilities. Results on the SparrKULee dataset show that DMF2Mel achieves a Pearson correlation coefficient of 0.074 in mel spectrogram reconstruction for known subjects (a 48% improvement over the baseline) and 0.048 for unknown subjects (a 35% improvement over the baseline).Code is available at: https://github.com/fchest/DMF2Mel.
Chinese: 本文提出动态多尺度融合网络(DMF2Mel)来解决从脑信号中重建连续想象语音的难题,在SparrKULee数据集上对已知和未知被试的梅尔频谱图重建均实现了显著提升。
English: This paper introduces the Dynamic Multiscale Fusion Network (DMF2Mel) to address challenges in reconstructing continuous imagined speech from brain signals, achieving significant improvements in mel spectrogram reconstruction for both known and unknown subjects on the SparrKULee dataset.
Authors:Shuaijin Wan
Abstract:
Human motion is a continuous physical process in 3D space, governed by complex dynamic and kinematic constraints. Existing methods typically represent the human pose as an abstract graph structure, neglecting the intrinsic physical dependencies between joints, which increases learning difficulty and makes the model prone to generating unrealistic motions. In this paper, we propose GGMotion, a group graph dynamics-kinematics network that models human topology in groups to better leverage dynamics and kinematics priors. To preserve the geometric equivariance in 3D space, we propose a novel radial field for the graph network that captures more comprehensive spatio-temporal dependencies by aggregating joint features through spatial and temporal edges. Inter-group and intra-group interaction modules are employed to capture the dependencies of joints at different scales. Combined with equivariant multilayer perceptrons (MLP), joint position features are updated in each group through parallelized dynamics-kinematics propagation to improve physical plausibility. Meanwhile, we introduce an auxiliary loss to supervise motion priors during training. Extensive experiments on three standard benchmarks, including Human3.6M, CMU-Mocap, and 3DPW, demonstrate the effectiveness and superiority of our approach, achieving a significant performance margin in short-term motion prediction. The code is available at https://github.com/inkcat520/GGMotion.git.
中文:提出的GGMotion模型通过分组建模人体关节,更好地融合动力学和运动学先验,采用创新的径向场图网络和交互模块增强时空依赖性与物理合理性,在多个标准数据集上取得了优越的预测性能。
English: The proposed GGMotion model improves human motion prediction by grouping joints to better incorporate physical dynamics and kinematics, using a novel graph network with radial fields and interaction modules to enhance spatial-temporal dependencies and physical realism, achieving superior results on standard benchmarks.
Authors:Jiaxu Wan, Xu Wang, Mengwei Xie, Xinyuan Chang, Xinran Liu, Zheng Pan, Mu Xu, Ding Yuan
Abstract:
Autonomous vehicles rely on global standard-definition (SD) maps for road-level route planning and online local high-definition (HD) maps for lane-level navigation. However, recent work concentrates on construct online HD maps, often overlooking the association of global SD maps with online HD maps for hybrid navigation, making challenges in utilizing online HD maps in the real world. Observing the lack of the capability of autonomous vehicles in navigation, we introduce \textbf{O}nline \textbf{M}ap \textbf{A}ssociation, the first benchmark for the association of hybrid navigation-oriented online maps, which enhances the planning capabilities of autonomous vehicles. Based on existing datasets, the OMA contains 480k of roads and 260k of lane paths and provides the corresponding metrics to evaluate the performance of the model. Additionally, we propose a novel framework, named Map Association Transformer, as the baseline method, using path-aware attention and spatial attention mechanisms to enable the understanding of geometric and topological correspondences. The code and dataset can be accessed at https://github.com/WallelWan/OMA-MAT.
中文: 本文提出了首个面向混合导航的在线地图关联基准OMA,将全局标清地图与在线高清地图相结合,并引入基于路径和空间注意力机制的Map Association Transformer作为基线方法,以提升自动驾驶车辆的规划能力。
English: This paper introduces the Online Map Association (OMA) benchmark, the first to link global standard-definition maps with online high-definition maps for hybrid navigation in autonomous vehicles, and proposes the Map Association Transformer as a baseline method to enhance planning capabilities.
Authors:Hongzhi Zhang, Jia Fu, Jingyuan Zhang, Kai Fu, Qi Wang, Fuzheng Zhang, Guorui Zhou
Abstract:
Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \emph{RLEP}\, -- \,Reinforcement Learning with Experience rePlay\, -- \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.
中文: RLEP框架通过重放已验证的高质量轨迹,在大语言模型强化学习中提升训练稳定性与性能,使Qwen2.5-Math-7B模型在数学基准测试上实现更快收敛和显著精度提升。
English: RLEP is a two-phase reinforcement learning framework that enhances training stability and performance by replaying verified high-quality trajectories, achieving faster convergence and improved accuracy on mathematical benchmarks with the Qwen2.5-Math-7B model.
Authors:Korbinian Moller, Rafael Neher, Marvin Seegert, Johannes Betz
Abstract:
Ensuring the functional safety of motion planning modules in autonomous vehicles remains a critical challenge, especially when dealing with complex or learning-based software. Online verification has emerged as a promising approach to monitor such systems at runtime, yet its integration into embedded real-time environments remains limited. This work presents a safeguarding concept for motion planning that extends prior approaches by introducing a time safeguard. While existing methods focus on geometric and dynamic feasibility, our approach additionally monitors the temporal consistency of planning outputs to ensure timely system response. A prototypical implementation on a real-time operating system evaluates trajectory candidates using constraint-based feasibility checks and cost-based plausibility metrics. Preliminary results show that the safeguarding module operates within real-time bounds and effectively detects unsafe trajectories. However, the full integration of the time safeguard logic and fallback strategies is ongoing. This study contributes a modular and extensible framework for runtime trajectory verification and highlights key aspects for deployment on automotive-grade hardware. Future work includes completing the safeguarding logic and validating its effectiveness through hardware-in-the-loop simulations and vehicle-based testing. The code is available at: https://github.com/TUM-AVS/motion-planning-supervisor
本研究为自动驾驶车辆运动规划引入了时间保障机制,通过监控规划输出的时间一致性及几何可行性,在实时系统中有效识别危险轨迹,但完整集成仍在进行中。
This work introduces a time safeguard for autonomous vehicle motion planning that monitors temporal consistency alongside geometric feasibility, with real-time implementation showing effective detection of unsafe trajectories while integration remains ongoing.
Authors:Ling Zhou, Runtian Yuan, Yi Liu, Yuejie Zhang, Rui Feng, Shang Gao
Abstract:
Ultrasound imaging is a prevalent diagnostic tool known for its simplicity and non-invasiveness. However, its inherent characteristics often introduce substantial noise, posing considerable challenges for automated lesion or organ segmentation in ultrasound video sequences. To address these limitations, we propose the Dual Semantic-Aware Network (DSANet), a novel framework designed to enhance noise robustness in ultrasound video segmentation by fostering mutual semantic awareness between local and global features. Specifically, we introduce an Adjacent-Frame Semantic-Aware (AFSA) module, which constructs a channel-wise similarity matrix to guide feature fusion across adjacent frames, effectively mitigating the impact of random noise without relying on pixel-level relationships. Additionally, we propose a Local-and-Global Semantic-Aware (LGSA) module that reorganizes and fuses temporal unconditional local features, which capture spatial details independently at each frame, with conditional global features that incorporate temporal context from adjacent frames. This integration facilitates multi-level semantic representation, significantly improving the model's resilience to noise interference. Extensive evaluations on four benchmark datasets demonstrate that DSANet substantially outperforms state-of-the-art methods in segmentation accuracy. Moreover, since our model avoids pixel-level feature dependencies, it achieves significantly higher inference FPS than video-based methods, and even surpasses some image-based models. Code can be found in \href{https://github.com/ZhouL2001/DSANet}{DSANet}
中文: DSANet通过双语义感知模块增强超声视频分割的噪声鲁棒性,利用局部与全局特征融合在保持高精度的同时实现超越现有方法的推理速度。
English: DSANet introduces dual semantic-aware modules to enhance ultrasound video segmentation by improving noise robustness through local-global feature integration, achieving superior accuracy and faster inference than existing methods.
Authors:Nishit V. Pandya, Andrey Labunets, Sicun Gao, Earlence Fernandes
Abstract:
A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data, so that the LLM does not follow instructions that might be present with data. There are several academic systems and production-level implementations of this idea. We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks and showing that the defenses do not provide the claimed security properties. Specifically, we construct a novel attention-based attack algorithm for text-based LLMs and apply it to two recent whitebox defenses SecAlign (CCS 2025) and StruQ (USENIX Security 2025), showing attacks with success rates of up to 70% with modest increase in attacker budget in terms of tokens. Our findings make fundamental progress towards understanding the robustness of prompt injection defenses in the whitebox setting. We release our code and attacks at https://github.com/nishitvp/better_opts_attacks
中文: 本研究通过开发基于优化的攻击方法评估大型语言模型中提示注入防御的鲁棒性,证明现有方法无法提供足够安全性,对近期防御措施的攻击成功率高达70%。
English: This study evaluates the robustness of prompt injection defenses in large language models by developing optimization-based attacks, demonstrating that existing methods fail to provide adequate security with success rates reaching 70% against recent defenses.
Authors:Yuntian Liu, Tao Zhu, Xiaoyang Liu, Yu Chen, Zhaoxuan Liu, Qingfeng Guo, Jiashuo Zhang, Kangjie Bao, Tao Luo
Abstract:
Statement autoformalization, the automated translation of statements from natural language into formal languages, has become a subject of extensive research, yet the development of robust automated evaluation metrics remains limited. Existing evaluation methods often lack semantic understanding, face challenges with high computational costs, and are constrained by the current progress of automated theorem proving. To address these issues, we propose GTED (Generalized Tree Edit Distance), a novel evaluation framework that first standardizes formal statements and converts them into operator trees, then determines the semantic similarity using the eponymous GTED metric. Across the miniF2F and ProofNet benchmarks, GTED consistently ranks as a top-performing metric, achieving the highest accuracy and Kappa on miniF2F and the joint-highest accuracy on ProofNet. This strong overall performance provides the community with a computationally lightweight and more faithful metric for automated evaluation. The code and experimental results are available at https://github.com/XiaoyangLiu-sjtu/GTED.
Chinese: 本文提出GTED这一新型评估框架,通过将形式化语句标准化为运算符树并测量语义相似性,解决了自动形式化评估中的现有局限,在基准测试中表现优异且计算效率高。
English: The paper introduces GTED, a novel evaluation framework that addresses limitations in autoformalization by standardizing formal statements into operator trees and measuring semantic similarity, achieving top performance on benchmarks while being computationally efficient.
Authors:Yongtang Bao, Chengjie Tang, Yuze Wang, Haojie Li
Abstract:
Reconstructing and segmenting scenes from unconstrained photo collections obtained from the Internet is a novel but challenging task. Unconstrained photo collections are easier to get than well-captured photo collections. These unconstrained images suffer from inconsistent lighting and transient occlusions, which makes segmentation challenging. Previous segmentation methods cannot address transient occlusions or accurately restore the scene's lighting conditions. Therefore, we propose Seg-Wild, an interactive segmentation method based on 3D Gaussian Splatting for unconstrained image collections, suitable for in-the-wild scenes. We integrate multi-dimensional feature embeddings for each 3D Gaussian and calculate the feature similarity between the feature embeddings and the segmentation target to achieve interactive segmentation in the 3D scene. Additionally, we introduce the Spiky 3D Gaussian Cutter (SGC) to smooth abnormal 3D Gaussians. We project the 3D Gaussians onto a 2D plane and calculate the ratio of 3D Gaussians that need to be cut using the SAM mask. We also designed a benchmark to evaluate segmentation quality in in-the-wild scenes. Experimental results demonstrate that compared to previous methods, Seg-Wild achieves better segmentation results and reconstruction quality. Our code will be available at https://github.com/Sugar0725/Seg-Wild.
中文:Seg-Wild是一种基于3D高斯抛雪的交互式分割方法,通过多维特征嵌入和尖峰3D高斯切割器处理无约束照片集中的光照不一致和瞬时遮挡问题,在野外场景中实现了更优的分割效果和重建质量。
English: Seg-Wild is an interactive segmentation method based on 3D Gaussian Splatting that addresses inconsistent lighting and transient occlusions in unconstrained photo collections, achieving superior segmentation and reconstruction quality through multi-dimensional feature embeddings and a Spiky 3D Gaussian Cutter.
Authors:Haotian Wang, Aoran Xiao, Xiaoqin Zhang, Meng Yang, Shijian Lu
Abstract:
Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances data diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on novel insights into inherent ambiguities and consistencies in object shapes and positions during 2D-to-3D projection, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. These models robustly provide pseudo depth labels with varied scene scales, affecting both local objects and global layouts, while ensuring projection consistency that supports generalization. To further diversify geometries, we incorporate interpolation and relocation strategies, as well as unlabeled images, extending the data coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings. Code: https://github.com/Wang-xjtu/PacGDC.
中文: PacGDC是一种标签高效技术,通过场景尺度操纵合成伪几何并利用深度基础模型,以最少标注工作增强泛化深度补全的数据多样性,在多种基准测试中展现出卓越性能。
English: PacGDC is a label-efficient technique that enhances data diversity for generalizable depth completion by synthesizing pseudo geometries through scene scale manipulation and leveraging depth foundation models, achieving strong performance across diverse benchmarks with minimal annotation effort.
Authors:Toby Handfield, Kevin Zollman
Abstract:
Scientific authorship norms vary dramatically across disciplines, from contribution-sensitive systems where first author is the greatest contributor and subsequent author order reflects relative input, to contribution-insensitive conventions like alphabetical ordering or senior-author-last. We develop evolutionary game-theoretic models to examine both how these divergent norms emerge and their subsequent effects on collaborative behavior. Our first model reveals that contribution-insensitive norms evolve when researchers who sacrifice positional advantage face the strongest adaptive pressure -- for example senior authors managing larger collaboration portfolios or bearing heavier reputational stakes. This "Red King" dynamic potentially explains why fields in which senior researchers command large labs, major grants, and extensive collaboration portfolios may paradoxically evolve conventions that favour junior-author positioning. Our second model demonstrates that established norms influence researchers' willingness to collaborate, with contribution-sensitive norms consistently outperforming insensitive alternatives in fostering successful partnerships. Contribution-insensitive norms create systematic coordination failures through two mechanisms: "main contributor resentment" when exceptional work goes unrecognized, and "second contributor resentment" when comparable efforts receive unequal credit. These findings suggest that widely adopted practices like senior-last positioning and alphabetical ordering may function as institutional frictions that impede valuable scientific collaborations rather than neutral organizational conventions, potentially reducing overall scientific productivity across affected disciplines.
中文摘要:本研究运用演化博弈论揭示,当资深研究者面临高强度适应压力时会产生贡献不敏感的作者署名规范,但这些规范会因贡献分配不公阻碍合作,而贡献敏感体系则能更有效地促进科研合作。
English Summary: The study uses evolutionary game theory to show that contribution-insensitive authorship norms emerge when senior researchers face high adaptive pressure, but these norms hinder collaboration by causing credit allocation issues, whereas contribution-sensitive systems foster more productive partnerships.
Authors:Sherry X. Chen, Yi Wei, Luowei Zhou, Suren Kumar
Abstract:
Recent advances in instruction-guided image editing underscore the need for effective automated evaluation. While Vision-Language Models (VLMs) have been explored as judges, open-source models struggle with alignment, and proprietary models lack transparency and cost efficiency. Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, an automated dataset creation approach which is then used to train a scoring model for instruction-guided image editing evaluation. We generate a large-scale dataset with over 100K samples and use it to fine-tune a LLaVA-NeXT-8B model modified to decode a numeric score from a custom token. The resulting scorer outperforms all open-source VLMs and Gemini-Pro 1.5 across all benchmarks, achieving a 0.0696 (+17.24%) gain in score correlation with human ratings on AURORA-Bench, and improving pair-wise comparison accuracy by 4.03% (+7.21%) on GenAI-Bench and 4.75% (+9.35%) on AURORA-Bench, respectively, compared to the state-of-the-art. The scorer can act as a reward model, enabling automated best edit selection and model fine-tuning. Notably, the proposed scorer can boost MagicBrush model's average evaluation score on ImagenHub from 5.90 to 6.43 (+8.98%). Our code and models are available at https://github.com/SherryXTChen/ADIEE.git.
Chinese Summary: 本文提出ADIEE方法,通过自动生成大规模数据集训练图像编辑评估模型,在多项基准测试中超越现有开源和专有模型,显著提升了与人类评分的相关性及比较准确率。
English Summary: This paper introduces ADIEE, an automated dataset creation method that trains a scoring model to evaluate instruction-guided image editing, achieving superior performance over existing models with significant improvements in human rating correlation and comparison accuracy.
Authors:Yichen Lu, Wei Dai, Jiaen Liu, Ching Wing Kwok, Zongheng Wu, Xudong Xiao, Ao Sun, Sheng Fu, Jianyuan Zhan, Yian Wang, Takatomo Saito, Sicheng Lai
Abstract:
LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our code is available here: https://github.com/pigeonai-org/ViDove
中文: ViDove是一种多模态翻译系统,通过结合视觉与上下文信息显著提升翻译质量,在BLEU和SubER指标上分别比现有最优方法提高28%和15%,并推出了长视频翻译评估基准DoveBench。
English: ViDove is a multimodal translation agent that integrates visual and contextual information, significantly improving translation quality by 28% in BLEU scores and 15% in SubER over previous methods, while introducing DoveBench for long-form video translation evaluation.
Authors:Andrew Fan, Simon D. Levy
Abstract:
As the demand for compute power in traditional neural networks has increased significantly, spiking neural networks (SNNs) have emerged as a potential solution to increasingly power-hungry neural networks. By operating on 0/1 spikes emitted by neurons instead of arithmetic multiply-and-accumulate operations, SNNs propagate information temporally and spatially, allowing for more efficient compute power. To this end, many architectures for accelerating and simulating SNNs have been developed, including Loihi, TrueNorth, and SpiNNaker. However, these chips are largely inaccessible to the wider community. Field programmable gate arrays (FPGAs) have been explored to serve as a middle ground between neuromorphic and non-neuromorphic hardware, but many proposed architectures require expensive high-end FPGAs or target a single SNN topology. This paper presents a framework consisting of a robust SNN acceleration architecture and a Pytorch-based SNN model compiler. Targeting any-to-any and/or fully connected SNNs, the FPGA architecture features a synaptic array that tiles across the SNN to propagate spikes. The architecture targets low-end FPGAs and requires very little (6358 LUT, 40.5 BRAM) resources. The framework, tested on a low-end Xilinx Artix-7 FPGA at 100 MHz, achieves competitive speed in recognizing MNIST digits (0.52 ms/img). Further experiments also show accurate simulation of hand coded any-to-any spiking neural networks on toy problems. All code and setup instructions are available at https://github.com/im-afan/snn-fpga}{\texttt{https://github.com/im-afan/snn-fpga.
Chinese: 本文提出了一种结合高效脉冲神经网络加速架构和基于PyTorch编译器的框架,针对低端FPGA设计,在MNIST手写数字识别等任务中以极少资源实现优越性能。
English: This paper introduces a framework with an efficient SNN acceleration architecture and a PyTorch-based compiler, designed for low-end FPGAs to achieve competitive performance in tasks like MNIST digit recognition with minimal resource usage.
Authors:Licong Xu, Milind Sarkar, Anto I. Lonappan, Ãñigo Zubeldia, Pablo Villanueva-Domingo, Santiago Casas, Christian Fidler, Chetana Amancharla, Ujjwal Tiwari, Adrian Bayer, Chadi Ait Ekioui, Miles Cranmer, Adrian Dimitrov, James Fergusson, Kahaan Gandhi, Sven Krippendorf, Andrew Laverick, Julien Lesgourgues, Antony Lewis, Thomas Meier, Blake Sherwin, Kristen Surrao, Francisco Villaescusa-Navarro, Chi Wang, Xueqing Xu, Boris Bolliet
Abstract:
We present a multi-agent system for automation of scientific research tasks, cmbagent (https://github.com/CMBAgents/cmbagent). The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.
中文: 我们推出cmbagent,这是一个由约30个专业大语言模型代理组成的全自动多智能体系统,能够协作执行科研任务,成功完成博士级别的宇宙学研究且性能优于顶尖大语言模型。
English: We introduce cmbagent, a fully autonomous multi-agent system with approximately 30 specialized LLM agents that collaboratively execute scientific research tasks, successfully completing a PhD-level cosmology study with superior performance to leading LLMs.
Authors:Heet Nitinkumar Dalsania
Abstract:
Modern deep learning implementations for medical imaging usually rely on large labeled datasets. These datasets are often difficult to obtain due to privacy concerns, high costs, and even scarcity of cases. In this paper, a label-efficient strategy is proposed for chest X-ray diagnosis that seeks to reflect real-world hospital scenarios. The experiments use the NIH Chest X-ray14 dataset and a pre-trained CLIP ViT-B/32 model. The model is adapted via partial fine-tuning of its visual encoder and then evaluated using zero-shot and few-shot learning with 1-16 labeled examples per disease class. The tests demonstrate that CLIP's pre-trained vision-language features can be effectively adapted to few-shot medical imaging tasks, achieving over 20\% improvement in mean AUC score as compared to the zero-shot baseline. The key aspect of this work is to attempt to simulate internal hospital workflows, where image archives exist but annotations are sparse. This work evaluates a practical and scalable solution for both common and rare disease diagnosis. Additionally this research is intended for academic and experimental purposes only and has not been peer reviewed yet. All code is found at https://github.com/heet007-code/CLIP-disease-xray.
中文: 本文提出了一种标签高效策略,利用预训练的CLIP模型进行胸部X光诊断,通过少样本学习在医学影像数据稀缺的情况下,实现了平均AUC分数超过20%的提升。
English: This paper introduces a label-efficient strategy using a pre-trained CLIP model for chest X-ray diagnosis, achieving over 20% improvement in mean AUC score through few-shot learning to address data scarcity in medical imaging.
Authors:Maya Kruse, Majid Afshar, Saksham Khatwani, Anoop Mayampurath, Guanhua Chen, Yanjun Gao
Abstract:
Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naïve ensemble baselines. In addition, we explore using MUSE as guided signals with chain-of-thought distillation to fine-tune LLMs for calibration. MUSE is available at:https://github.com/LARK-NLP-Lab/MUSE.
Chinese Summary: 本研究提出MUSE方法,通过利用模型多样性来识别和整合校准良好的子集,从而改进大语言模型的不确定性量化,显著提升了校准效果和预测性能。
English Summary: The study introduces MUSE, a method that leverages model diversity to improve uncertainty quantification in large language models by identifying and aggregating well-calibrated subsets, resulting in enhanced calibration and predictive performance.
Authors:Priyank Pathak, Yogesh S. Rawat
Abstract:
Clothes-Changing Re-Identification (CC-ReID) aims to recognize individuals across different locations and times, irrespective of clothing. Existing methods often rely on additional models or annotations to learn robust, clothing-invariant features, making them resource-intensive. In contrast, we explore the use of color - specifically foreground and background colors - as a lightweight, annotation-free proxy for mitigating appearance bias in ReID models. We propose Colors See, Colors Ignore (CSCI), an RGB-only method that leverages color information directly from raw images or video frames. CSCI efficiently captures color-related appearance bias ('Color See') while disentangling it from identity-relevant ReID features ('Color Ignore'). To achieve this, we introduce S2A self-attention, a novel self-attention to prevent information leak between color and identity cues within the feature space. Our analysis shows a strong correspondence between learned color embeddings and clothing attributes, validating color as an effective proxy when explicit clothing labels are unavailable. We demonstrate the effectiveness of CSCI on both image and video ReID with extensive experiments on four CC-ReID datasets. We improve the baseline by Top-1 2.9% on LTCC and 5.0% on PRCC for image-based ReID, and 1.0% on CCVID and 2.5% on MeVID for video-based ReID without relying on additional supervision. Our results highlight the potential of color as a cost-effective solution for addressing appearance bias in CC-ReID. Github: https://github.com/ppriyank/ICCV-CSCI-Person-ReID.
中文: 提出的CSCI方法利用图像颜色信息有效缓解换衣重识别中的外观偏差,无需额外标注即可在多个数据集上实现显著性能提升。
English: The proposed CSCI method utilizes color information from images to effectively mitigate appearance bias in clothes-changing re-identification without requiring additional annotations, achieving significant performance improvements across multiple datasets.
Authors:Florian Redhardt, Yassir Akram, Simon Schug
Abstract:
Can neural networks systematically capture discrete, compositional task structure despite their continuous, distributed nature? The impressive capabilities of large-scale neural networks suggest that the answer to this question is yes. However, even for the most capable models, there are still frequent failure cases that raise doubts about their compositionality. Here, we seek to understand what it takes for a standard neural network to generalize over tasks that share compositional structure. We find that simply scaling data and model size leads to compositional generalization. We show that this holds across different task encodings as long as the training distribution sufficiently covers the task space. In line with this finding, we prove that standard multilayer perceptrons can approximate a general class of compositional task families to arbitrary precision using only a linear number of neurons with respect to the number of task modules. Finally, we uncover that if networks successfully compositionally generalize, the constituents of a task can be linearly decoded from their hidden activations. We show that this metric correlates with failures of text-to-image generation models to compose known concepts.
中文: 通过扩大数据和模型规模,神经网络能够实现组合泛化,不仅能近似复杂的任务结构,还能从隐藏激活中线性解码任务成分,这一点在文本到图像生成模型中得到了验证。
English: Scaling data and model size enables neural networks to achieve compositional generalization, allowing them to approximate complex task structures and decode task components from hidden activations, as demonstrated in text-to-image generation models.
Authors:Xueqing Xu, Boris Bolliet, Adrian Dimitrov, Andrew Laverick, Francisco Villaescusa-Navarro, Licong Xu, Ãñigo Zubeldia
Abstract:
We evaluate 9 Retrieval Augmented Generation (RAG) agent configurations on 105 Cosmology Question-Answer (QA) pairs that we built specifically for this purpose.The RAG configurations are manually evaluated by a human expert, that is, a total of 945 generated answers were assessed. We find that currently the best RAG agent configuration is with OpenAI embedding and generative model, yielding 91.4\% accuracy. Using our human evaluation results we calibrate LLM-as-a-Judge (LLMaaJ) system which can be used as a robust proxy for human evaluation. These results allow us to systematically select the best RAG agent configuration for multi-agent system for autonomous scientific discovery in astrophysics (e.g., cmbagent presented in a companion paper) and provide us with an LLMaaJ system that can be scaled to thousands of cosmology QA pairs. We make our QA dataset, human evaluation results, RAG pipelines, and LLMaaJ system publicly available for further use by the astrophysics community.
中文: 本研究评估了九种RAG智能体在定制宇宙学问答数据集上的表现,确定OpenAI模型以91.4%准确率最优,并开发了经过校准的大语言模型评审系统,为天体物理学研究提供可扩展的评估方案与最优智能体选择。
English: This study evaluates nine RAG agent configurations on a custom cosmology QA dataset, identifying OpenAI's model as the top performer with 91.4% accuracy and developing a calibrated LLM-as-a-Judge system to enable scalable evaluation and optimal agent selection for astrophysics research.
Authors:Hongyi Xie, Min Zhou, Qiao Yu, Jialiang Yu, Zhenli Sheng, Hong Xie, Defu Lian
Abstract:
As cloud services become increasingly integral to modern IT infrastructure, ensuring hardware reliability is essential to sustain high-quality service. Memory failures pose a significant threat to overall system stability, making accurate failure prediction through the analysis of memory error logs (i.e., Correctable Errors) imperative. Existing memory failure prediction approaches have notable limitations: rule-based expert models suffer from limited generalizability and low recall rates, while automated feature extraction methods exhibit suboptimal performance. To address these limitations, we propose M$^2$-MFP: a Multi-scale and hierarchical memory failure prediction framework designed to enhance the reliability and availability of cloud infrastructure. M$^2$-MFP converts Correctable Errors (CEs) into multi-level binary matrix representations and introduces a Binary Spatial Feature Extractor (BSFE) to automatically extract high-order features at both DIMM-level and bit-level. Building upon the BSFE outputs, we develop a dual-path temporal modeling architecture: 1) a time-patch module that aggregates multi-level features within observation windows, and 2) a time-point module that employs interpretable rule-generation trees trained on bit-level patterns. Experiments on both benchmark datasets and real-world deployment show the superiority of M$^2$-MFP as it outperforms existing state-of-the-art methods by significant margins. Code and data are available at this repository: https://github.com/hwcloud-RAS/M2-MFP.
Chinese: 提出的M²-MFP框架通过多尺度二进制矩阵表示和双路径时序架构来预测内存故障,从而提升云基础设施的可靠性,实验证明其性能显著优于现有方法。
English: The proposed M²-MFP framework enhances cloud infrastructure reliability by using multi-scale binary matrix representations and a dual-path temporal architecture to predict memory failures, significantly outperforming existing methods in experiments.
Authors:Renyang Liu, Guanlin Li, Tianwei Zhang, See-Kiong Ng
Abstract:
Recent advances in image generation models (IGMs), particularly diffusion-based architectures such as Stable Diffusion (SD), have markedly enhanced the quality and diversity of AI-generated visual content. However, their generative capability has also raised significant ethical, legal, and societal concerns, including the potential to produce harmful, misleading, or copyright-infringing content. To mitigate these concerns, machine unlearning (MU) emerges as a promising solution by selectively removing undesirable concepts from pretrained models. Nevertheless, the robustness and effectiveness of existing unlearning techniques remain largely unexplored, particularly in the presence of multi-modal adversarial inputs.
To bridge this gap, we propose Recall, a novel adversarial framework explicitly designed to compromise the robustness of unlearned IGMs. Unlike existing approaches that predominantly rely on adversarial text prompts, Recall exploits the intrinsic multi-modal conditioning capabilities of diffusion models by efficiently optimizing adversarial image prompts with guidance from a single semantically relevant reference image. Extensive experiments across ten state-of-the-art unlearning methods and diverse tasks show that Recall consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original textual prompt. These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions to ensure the safety and reliability of generative models. Code and data are publicly available at \textcolor{blue}{https://github.com/ryliu68/RECALL}.
中文摘要:图像生成模型的进步引发了伦理担忧,机器学习遗忘成为解决方案,但提出的Recall框架通过对抗性图像提示揭示了现有遗忘方法的脆弱性,有效削弱了其鲁棒性。
English Summary: Recent advances in image generation models raise ethical concerns, prompting machine unlearning as a solution, but the proposed Recall framework exposes vulnerabilities in current unlearning methods by using adversarial image prompts to compromise their robustness effectively.
Authors:Xinglong Liang, Jiaju Huang, Luyi Han, Tianyu Zhang, Xin Wang, Yuan Gao, Chunyao Lu, Lishan Cai, Tao Tan, Ritse Mann
Abstract:
PET-CT lesion segmentation is challenging due to noise sensitivity, small and variable lesion morphology, and interference from physiological high-metabolic signals. Current mainstream approaches follow the practice of one network solving the segmentation of multiple cancer lesions by treating all cancers as a single task. However, this overlooks the unique characteristics of different cancer types. Considering the specificity and similarity of different cancers in terms of metastatic patterns, organ preferences, and FDG uptake intensity, we propose DpDNet, a Dual-Prompt-Driven network that incorporates specific prompts to capture cancer-specific features and common prompts to retain shared knowledge. Additionally, to mitigate information forgetting caused by the early introduction of prompts, prompt-aware heads are employed after the decoder to adaptively handle multiple segmentation tasks. Experiments on a PET-CT dataset with four cancer types show that DpDNet outperforms state-of-the-art models. Finally, based on the segmentation results, we calculated MTV, TLG, and SUVmax for breast cancer survival analysis. The results suggest that DpDNet has the potential to serve as a valuable tool for personalized risk stratification, supporting clinicians in optimizing treatment strategies and improving outcomes. Code is available at https://github.com/XinglongLiang08/DpDNet.
中文: DpDNet通过双提示驱动网络,结合特定提示捕捉癌症特征与通用提示保留共享知识,在PET-CT病灶分割中优于现有模型,并具备个性化风险分层的临床应用潜力。
English: DpDNet, a dual-prompt-driven network, effectively segments PET-CT lesions by capturing cancer-specific features and shared knowledge, outperforming existing models and demonstrating potential for personalized risk stratification in clinical applications.
Authors:Cristina Mata, Kanchana Ranasinghe, Michael S. Ryoo
Abstract:
Unsupervised domain adaptation (UDA) involves learning class semantics from labeled data within a source domain that generalize to an unseen target domain. UDA methods are particularly impactful for semantic segmentation, where annotations are more difficult to collect than in image classification. Despite recent advances in large-scale vision-language representation learning, UDA methods for segmentation have not taken advantage of the domain-agnostic properties of text. To address this, we present a novel Covariance-based Pixel-Text loss, CoPT, that uses domain-agnostic text embeddings to learn domain-invariant features in an image segmentation encoder. The text embeddings are generated through our LLM Domain Template process, where an LLM is used to generate source and target domain descriptions that are fed to a frozen CLIP model and combined. In experiments on four benchmarks we show that a model trained using CoPT achieves the new state of the art performance on UDA for segmentation. The code can be found at https://github.com/cfmata/CoPT.
Chinese: 本文提出CoPT,一种基于协方差的像素-文本损失方法,利用大语言模型生成的领域无关文本嵌入,在语义分割的无监督领域自适应任务中实现了最先进的性能。
English: This paper introduces CoPT, a novel covariance-based pixel-text loss that leverages domain-agnostic text embeddings from LLM-generated descriptions to achieve state-of-the-art performance in unsupervised domain adaptation for semantic segmentation.
Authors:Zhiwei Hu, VÃctor Gutiérrez-Basulto, Zhiliang Xiang, Ru Li, Jeff Z. Pan
Abstract:
Multimodal Entity Linking (MEL) aims to link ambiguous mentions within multimodal contexts to associated entities in a multimodal knowledge base. Existing approaches to MEL introduce multimodal interaction and fusion mechanisms to bridge the modality gap and enable multi-grained semantic matching. However, they do not address two important problems: (i) mention ambiguity, i.e., the lack of semantic content caused by the brevity and omission of key information in the mention's textual context; (ii) dynamic selection of modal content, i.e., to dynamically distinguish the importance of different parts of modal information. To mitigate these issues, we propose a Multi-level Mixture of Experts (MMoE) model for MEL. MMoE has four components: (i) the description-aware mention enhancement module leverages large language models to identify the WikiData descriptions that best match a mention, considering the mention's textual context; (ii) the multimodal feature extraction module adopts multimodal feature encoders to obtain textual and visual embeddings for both mentions and entities; (iii)-(iv) the intra-level mixture of experts and inter-level mixture of experts modules apply a switch mixture of experts mechanism to dynamically and adaptively select features from relevant regions of information. Extensive experiments demonstrate the outstanding performance of MMoE compared to the state-of-the-art. MMoE's code is available at: https://github.com/zhiweihu1103/MEL-MMoE.
中文摘要:本研究提出的多级专家混合模型通过利用大型语言模型和自适应特征选择机制,解决了多模态实体链接中的指称歧义和模态内容动态选择问题,实验证明其性能优于现有先进方法。
English Summary: The proposed Multi-level Mixture of Experts (MMoE) model addresses mention ambiguity and dynamic modality selection in Multimodal Entity Linking by leveraging large language models and adaptive feature selection mechanisms, demonstrating superior performance over existing methods.
Authors:Yimin Du
Abstract:
This paper presents a comprehensive machine learning framework for quantitative trading that achieves superior risk-adjusted returns through systematic factor engineering, real-time computation optimization, and cross-sectional portfolio construction. Our approach integrates multi-factor alpha discovery with bias correction techniques, leveraging PyTorch-accelerated factor computation and advanced portfolio optimization. The system processes 500-1000 factors derived from open-source alpha101 extensions and proprietary market microstructure signals. Key innovations include tensor-based factor computation acceleration, geometric Brownian motion data augmentation, and cross-sectional neutralization strategies. Empirical validation on Chinese A-share markets (2010-2024) demonstrates annualized returns of $20\%$ with Sharpe ratios exceeding 2.0, significantly outperforming traditional approaches. Our analysis reveals the critical importance of bias correction in factor construction and the substantial impact of cross-sectional portfolio optimization on strategy performance. Code and experimental implementations are available at: https://github.com/initial-d/ml-quant-trading
中文摘要:本文提出了一种量化交易的机器学习框架,通过系统性因子工程和投资组合优化,在中国A股市场实现了20%年化收益率和超过2.0的夏普比率,显著优于传统方法。
English Summary: This paper introduces a machine learning framework for quantitative trading that enhances risk-adjusted returns through advanced factor engineering and portfolio optimization, achieving a 20% annualized return with Sharpe ratios over 2.0 in Chinese A-share markets.
Authors:Arnas Uselis, Andrea Dittadi, Seong Joon Oh
Abstract:
Compositional understanding is crucial for human intelligence, yet it remains unclear whether contemporary vision models exhibit it. The dominant machine learning paradigm is built on the premise that scaling data and model sizes will improve out-of-distribution performance, including compositional generalization. We test this premise through controlled experiments that systematically vary data scale, concept diversity, and combination coverage. We find that compositional generalization is driven by data diversity, not mere data scale. Increased combinatorial coverage forces models to discover a linearly factored representational structure, where concepts decompose into additive components. We prove this structure is key to efficiency, enabling perfect generalization from few observed combinations. Evaluating pretrained models (DINO, CLIP), we find above-random yet imperfect performance, suggesting partial presence of this structure. Our work motivates stronger emphasis on constructing diverse datasets for compositional generalization, and considering the importance of representational structure that enables efficient compositional learning. Code available at https://github.com/oshapio/visual-compositional-generalization.
中文摘要:视觉模型的组合泛化能力主要取决于数据多样性而非数据规模,需要通过线性分解的表征结构将概念拆解为可加性成分,从而实现高效学习。
English Summary: Compositional generalization in vision models is driven by data diversity rather than data scale, requiring linearly factored representations that decompose concepts into additive components for efficient learning.
Authors:Martin Marek, Sanae Lotfi, Aditya Somasundaram, Andrew Gordon Wilson, Micah Goldblum
Abstract:
Conventional wisdom dictates that small batch sizes make language model pretraining and fine-tuning unstable, motivating gradient accumulation, which trades off the number of optimizer steps for a proportional increase in batch size. While it is common to decrease the learning rate for smaller batch sizes, other hyperparameters are often held fixed. In this work, we revisit small batch sizes all the way down to batch size one, and we propose a rule for scaling Adam hyperparameters to small batch sizes. In particular, rather than holding the decay rate of the second moment fixed across batch sizes, we propose to hold its half-life fixed in terms of tokens. We find that small batch sizes (1) train stably, (2) are consistently more robust to hyperparameter choices, (3) achieve equal or better per-FLOP performance than larger batch sizes, and (4) notably enable stable language model training with vanilla SGD, even without momentum, despite storing no optimizer state. Building on these results, we provide practical recommendations for selecting a batch size and setting optimizer hyperparameters. We further recommend against gradient accumulation unless training on multiple devices with multiple model replicas. Finally, we show that a small batch size combined with an optimizer with a small state size can provide the performance benefits of full fine-tuning while maintaining a similar memory footprint to LoRA.
中文: 本研究表明,通过调整Adam超参数以保持基于令牌的第二矩衰减半衰期固定,小批量(包括批量大小为一)能够实现稳定高效的语言模型训练,相比大批量具有更好的鲁棒性、性能和内存效率。
English: This study demonstrates that small batch sizes, including batch size one, can achieve stable and efficient language model training by adjusting Adam hyperparameters to maintain a fixed token-based half-life for the second moment decay, offering improved robustness, performance, and memory efficiency compared to larger batches.
Authors:Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, Jingbo Wang
Abstract:
Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.
中文: 本研究推出了迄今最大的包含200万序列的MotionMillion人体运动数据集和综合评估基准,通过70亿参数模型实现了零样本文本生成运动的强泛化能力。
English: This research introduces MotionMillion, the largest human motion dataset with over 2 million sequences, and a comprehensive evaluation benchmark to advance zero-shot text-to-motion generation, achieving strong generalization with a 7B-parameter model.
Authors:Shanle Zheng, Keqin Bao, Jizhi Zhang, Yang Zhang, Fuli Feng, Xiangnan He
Abstract:
LLM-based recommender systems have made significant progress; however, the deployment cost associated with the large parameter volume of LLMs still hinders their real-world applications. This work explores parameter pruning to improve parameter efficiency while maintaining recommendation quality, thereby enabling easier deployment. Unlike existing approaches that focus primarily on inter-layer redundancy, we uncover intra-layer redundancy within components such as self-attention and MLP modules. Building on this analysis, we propose a more fine-grained pruning approach that integrates both intra-layer and layer-wise pruning. Specifically, we introduce a three-stage pruning strategy that progressively prunes parameters at different levels and parts of the model, moving from intra-layer to layer-wise pruning, or from width to depth. Each stage also includes a performance restoration step using distillation techniques, helping to strike a balance between performance and parameter efficiency. Empirical results demonstrate the effectiveness of our approach: across three datasets, our models achieve an average of 88% of the original model's performance while pruning more than 95% of the non-embedding parameters. This underscores the potential of our method to significantly reduce resource requirements without greatly compromising recommendation quality. Our code will be available at: https://github.com/zheng-sl/PruneRec
中文: 本研究提出一种细粒度三阶段剪枝策略,通过层内与层级剪枝结合蒸馏恢复技术,在保持88%性能的同时削减超95%非嵌入参数,显著提升基于大语言模型的推荐系统部署效率。
English: This study introduces a fine-grained three-stage pruning strategy that reduces over 95% of non-embedding parameters while maintaining 88% performance, enabling efficient deployment of LLM-based recommender systems through intra-layer and layer-wise pruning with distillation recovery.
Authors:Hui Li, Pengfei Yang, Juanyang Chen, Le Dong, Yanxin Chen, Quan Wang
Abstract:
Knowledge distillation as an efficient knowledge transfer technique, has achieved remarkable success in unimodal scenarios. However, in cross-modal settings, conventional distillation methods encounter significant challenges due to data and statistical heterogeneities, failing to leverage the complementary prior knowledge embedded in cross-modal teacher models. This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift. To address these limitations, we propose MST-Distill, a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers. Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network that facilitates adaptive and dynamic distillation. This architecture effectively transcends the constraints of traditional methods that rely on monotonous and static teacher models. Additionally, we introduce a plug-in masking module, independently trained to suppress modality-specific discrepancies and reconstruct teacher representations, thereby mitigating knowledge drift and enhancing transfer effectiveness. Extensive experiments across five diverse multimodal datasets, spanning visual, audio, and text, demonstrate that our method significantly outperforms existing state-of-the-art knowledge distillation methods in cross-modal distillation tasks. The source code is available at https://github.com/Gray-OREO/MST-Distill.
Chinese: 本文提出MST-Distill跨模态知识蒸馏框架,通过混合专家教师模型和自适应路由网络解决蒸馏路径选择与知识漂移问题,在多种跨模态数据集上显著优于现有最优方法。
English: This paper introduces MST-Distill, a cross-modal knowledge distillation framework that utilizes a mixture of specialized teachers and an adaptive routing network to overcome limitations like distillation path selection and knowledge drift, significantly outperforming existing methods across diverse multimodal datasets.
Authors:Eunbyeol Cho, Jiyoun Kim, Minjae Lee, Sungjin Park, Edward Choi
Abstract:
Electronic Health Records (EHR) are time-series relational databases that record patient interactions and medical events over time, serving as a critical resource for healthcare research and applications. However, privacy concerns and regulatory restrictions limit the sharing and utilization of such sensitive data, necessitating the generation of synthetic EHR datasets. Unlike previous EHR synthesis methods, which typically generate medical records consisting of expert-chosen features (e.g. a few vital signs or structured codes only), we introduce RawMed, the first framework to synthesize multi-table, time-series EHR data that closely resembles raw EHRs. Using text-based representation and compression techniques, RawMed captures complex structures and temporal dynamics with minimal preprocessing. We also propose a new evaluation framework for multi-table time-series synthetic EHRs, assessing distributional similarity, inter-table relationships, temporal dynamics, and privacy. Validated on two open-source EHR datasets, RawMed outperforms baseline models in fidelity and utility. The code is available at https://github.com/eunbyeol-cho/RawMed.
Chinese: RawMed 是一种创新框架,通过基于文本的表示和压缩技术合成多表时间序列电子健康记录,模拟原始数据,在保真度和实用性上优于基线方法,同时解决隐私问题。
English: RawMed is a novel framework that synthesizes multi-table, time-series electronic health records resembling raw data using text-based representation and compression, outperforming baselines in fidelity and utility while addressing privacy concerns.
Authors:Fei Teng, Kai Luo, Sheng Wu, Siyu Li, Pujun Guo, Jiale Wei, Kunyu Peng, Jiaming Zhang, Kailun Yang
Abstract:
Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360° surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have demonstrated strong data regeneration capabilities, they can only learn from the fixed data distribution of existing datasets and cannot achieve high-quality, controllable panoramic generation. In this paper, we propose the first panoramic generation method Percep360 for autonomous driving. Percep360 enables coherent generation of panoramic data with control signals based on the stitched panoramic data. Percep360 focuses on two key aspects: coherence and controllability. Specifically, to overcome the inherent information loss caused by the pinhole sampling process, we propose the Local Scenes Diffusion Method (LSDM). LSDM reformulates the panorama generation as a spatially continuous diffusion process, bridging the gaps between different data distributions. Additionally, to achieve the controllable generation of panoramic images, we propose a Probabilistic Prompting Method (PPM). PPM dynamically selects the most relevant control cues, enabling controllable panoramic image generation. We evaluate the effectiveness of the generated images from three perspectives: image quality assessment (i.e., no-reference and with reference), controllability, and their utility in real-world Bird's Eye View (BEV) segmentation. Notably, the generated data consistently outperforms the original stitched images in no-reference quality metrics and enhances downstream perception models. The source code will be publicly available at https://github.com/Bryant-Teng/Percep360.
中文: 本文提出的Percep360是首个自动驾驶全景生成方法,通过局部场景扩散模型和概率提示机制实现连贯可控的全景数据合成,在图像质量和下游感知任务中均展现出显著优势。
English: This paper introduces Percep360, the first panoramic generation method for autonomous driving that ensures coherent and controllable data synthesis through a Local Scenes Diffusion Method and Probabilistic Prompting Method, significantly enhancing image quality and downstream perception tasks.
Authors:Yixin Zhao, Yuyi Zhang, Lianwen Jin
Abstract:
Research on the attribute information of calligraphy, such as styles, dynasties, and calligraphers, holds significant cultural and historical value. However, the styles of Chinese calligraphy characters have evolved dramatically through different dynasties and the unique touches of calligraphers, making it highly challenging to accurately recognize these different characters and their attributes. Furthermore, existing calligraphic datasets are extremely scarce, and most provide only character-level annotations without additional attribute information. This limitation has significantly hindered the in-depth study of Chinese calligraphy. To fill this gap, we present a novel Multi-Attribute Chinese Calligraphy Character Dataset (MCCD). The dataset encompasses 7,765 categories with a total of 329,715 isolated image samples of Chinese calligraphy characters, and three additional subsets were extracted based on the attribute labeling of the three types of script styles (10 types), dynasties (15 periods) and calligraphers (142 individuals). The rich multi-attribute annotations render MCCD well-suited diverse research tasks, including calligraphic character recognition, writer identification, and evolutionary studies of Chinese characters. We establish benchmark performance through single-task and multi-task recognition experiments across MCCD and all of its subsets. The experimental results demonstrate that the complexity of the stroke structure of the calligraphic characters, and the interplay between their different attributes, leading to a substantial increase in the difficulty of accurate recognition. MCCD not only fills a void in the availability of detailed calligraphy datasets but also provides valuable resources for advancing research in Chinese calligraphy and fostering advancements in multiple fields. The dataset is available at https://github.com/SCUT-DLVCLab/MCCD.
Chinese Summary: 本研究提出了多属性中国书法字符数据集(MCCD),通过提供包含字体、朝代和书法家详细标注的329,715个字符图像,填补了书法标注数据稀缺的空白,为书法识别与历史研究提供了重要资源。
English Summary: This research introduces the Multi-Attribute Chinese Calligraphy Character Dataset (MCCD), a comprehensive resource addressing the scarcity of annotated calligraphy data by providing 329,715 character images with detailed script, dynasty, and calligrapher attributes to advance recognition and historical studies.
Authors:Xiao Wang, Jiahuan Pei, Diancheng Shui, Zhiguang Han, Xin Sun, Dawei Zhu, Xiaoyu Shen
Abstract:
Legal judgment prediction offers a compelling method to aid legal practitioners and researchers. However, the research question remains relatively under-explored: Should multiple defendants and charges be treated separately in LJP? To address this, we introduce a new dataset namely multi-person multi-charge prediction (MPMCP), and seek the answer by evaluating the performance of several prevailing legal large language models (LLMs) on four practical legal judgment scenarios: (S1) single defendant with a single charge, (S2) single defendant with multiple charges, (S3) multiple defendants with a single charge, and (S4) multiple defendants with multiple charges. We evaluate the dataset across two LJP tasks, i.e., charge prediction and penalty term prediction. We have conducted extensive experiments and found that the scenario involving multiple defendants and multiple charges (S4) poses the greatest challenges, followed by S2, S3, and S1. The impact varies significantly depending on the model. For example, in S4 compared to S1, InternLM2 achieves approximately 4.5% lower F1-score and 2.8% higher LogD, while Lawformer demonstrates around 19.7% lower F1-score and 19.0% higher LogD. Our dataset and code are available at https://github.com/lololo-xiao/MultiJustice-MPMCP.
中文: 本研究提出了一个多人多罪名法律判决预测数据集,发现涉及多名被告和多项罪名的场景对法律大模型最具挑战性,且不同模型的性能差异显著。
English: This study introduces a new dataset for multi-person multi-charge legal judgment prediction, finding that scenarios with multiple defendants and charges pose the greatest challenges to legal LLMs, with performance impacts varying significantly across different models.
Authors:Ziyan Liu, Chunxiao Fan, Haoran Lou, Yuexin Wu, Kaiwei Deng
Abstract:
The rapid expansion of memes on social media has highlighted the urgent need for effective approaches to detect harmful content. However, traditional data-driven approaches struggle to detect new memes due to their evolving nature and the lack of up-to-date annotated data. To address this issue, we propose MIND, a multi-agent framework for zero-shot harmful meme detection that does not rely on annotated data. MIND implements three key strategies: 1) We retrieve similar memes from an unannotated reference set to provide contextual information. 2) We propose a bi-directional insight derivation mechanism to extract a comprehensive understanding of similar memes. 3) We then employ a multi-agent debate mechanism to ensure robust decision-making through reasoned arbitration. Extensive experiments on three meme datasets demonstrate that our proposed framework not only outperforms existing zero-shot approaches but also shows strong generalization across different model architectures and parameter scales, providing a scalable solution for harmful meme detection. The code is available at https://github.com/destroy-lonely/MIND.
中文:提出的MIND框架通过多智能体策略实现零样本有害表情包检测,结合上下文检索、双向洞察推导和辩论机制,无需标注数据即可实现卓越性能与泛化能力。
English: The proposed MIND framework enables zero-shot harmful meme detection by leveraging multi-agent strategies, including context retrieval, bi-directional insight derivation, and debate mechanisms, achieving superior performance and generalization without annotated data.
Authors:Huishi Luo, Yiqing Wu, Yiwen Chen, Fuzhen Zhuang, Deqing Wang
Abstract:
Multi-domain recommendation leverages domain-general knowledge to improve recommendations across several domains. However, as platforms expand to dozens or hundreds of scenarios, training all domains in a unified model leads to performance degradation due to significant inter-domain differences. Existing domain grouping methods, based on business logic or data similarities, often fail to capture the true transfer relationships required for optimal grouping. To effectively cluster domains, we propose Causal Domain Clustering (CDC). CDC models domain transfer patterns within a large number of domains using two distinct effects: the Isolated Domain Affinity Matrix for modeling non-interactive domain transfers, and the Hybrid Domain Affinity Matrix for considering dynamic domain synergy or interference under joint training. To integrate these two transfer effects, we introduce causal discovery to calculate a cohesion-based coefficient that adaptively balances their contributions. A Co-Optimized Dynamic Clustering algorithm iteratively optimizes target domain clustering and source domain selection for training. CDC significantly enhances performance across over 50 domains on public datasets and in industrial settings, achieving a 4.9% increase in online eCPM. Code is available at https://github.com/Chrissie-Law/Causal-Domain-Clustering-for-Multi-Domain-Recommendation
中文: 本文提出因果域聚类(CDC)方法,通过因果发现建模域间传递模式并动态优化聚类,有效解决了多领域推荐中的性能下降问题,在50多个领域显著提升效果,使在线eCPM增长4.9%。
English: This paper introduces Causal Domain Clustering (CDC), a novel method that effectively groups domains in multi-domain recommendation by modeling transfer patterns through causal discovery and dynamic clustering, significantly boosting performance across over 50 domains with a 4.9% increase in online eCPM.
Authors:Yizhuo Wu, Ang Li, Chang Gao
Abstract:
Neural network (NN)-based Digital Predistortion (DPD) stands out in improving signal quality in wideband radio frequency (RF) power amplifiers (PAs) employing complex modulation. However, NN DPDs usually rely on a large number of parameters for effective linearization and can significantly contribute to the energy consumption of the digital back-end in RF systems. This paper presents OpenDPDv2, a unified framework for PA modeling, DPD learning, and model optimization to reduce power consumption while maintaining high linearization performance. The optimization techniques feature a novel DPD algorithm, TRes-DeltaGRU, alongside two energy-efficient methods. The top-performing 32-bit floating-point (FP32) TRes-DeltaGRU-DPD model achieves an Adjacent Channel Power Ratio (ACPR) of -59.4 dBc and Error Vector Magnitude (EVM) of -42.1 dBc. By exploiting fixed-point quantization and dynamic temporal sparsity of input signals and hidden neurons, the inference energy of our model can be reduced by 4.5X while still maintaining -50.3 dBc ACPR and -35.2 dB EVM with 56% temporal sparsity. This was evaluated using a TM3.1a 200 MHz bandwidth 256-QAM OFDM signal applied to a 3.5 GHz GaN Doherty RF PA. OpenDPDv2 code, datasets, and documentation are publicly accessible at: https://github.com/lab-emi/OpenDPD.
中文: OpenDPDv2是一个统一框架,通过TRes-DeltaGRU算法和能效优化技术,在保持宽带射频功率放大器高性能线性化的同时,显著降低了神经网络数字预失真处理的功耗。
English: OpenDPDv2 is a unified framework that introduces the TRes-DeltaGRU algorithm and energy-efficient optimization techniques to significantly reduce power consumption while maintaining high linearization performance in neural network-based digital predistortion for wideband RF power amplifiers.
Authors:Dahyun Lee, Yongrae Jo, Haeju Park, Moontae Lee
Abstract:
Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved passages are not only individually relevant but also collectively form a comprehensive set. Existing approaches primarily rerank top-k passages based on their individual relevance, often failing to meet the information needs of complex queries in multi-hop question answering. In this work, we propose a set-wise passage selection approach and introduce SETR, which explicitly identifies the information requirements of a query through Chain-of-Thought reasoning and selects an optimal set of passages that collectively satisfy those requirements. Experiments on multi-hop RAG benchmarks show that SETR outperforms both proprietary LLM-based rerankers and open-source baselines in terms of answer correctness and retrieval quality, providing an effective and efficient alternative to traditional rerankers in RAG systems. The code is available at https://github.com/LGAI-Research/SetR
中文: 本文提出SETR方法,通过思维链推理识别查询需求并选择集体全面的段落,在多跳RAG基准测试中优于现有重排序器。
English: This paper introduces SETR, a set-wise passage selection method that uses Chain-of-Thought reasoning to identify query requirements and select collectively comprehensive passages, outperforming existing rerankers in multi-hop RAG benchmarks.
Authors:Matej Straka, Martin Schmid
Abstract:
We introduce a real-time strategy game environment based on Generals.io, a game with thousands of weekly active players. Our environment is fully compatible with Gymnasium and PettingZoo and is capable of running thousands of frames per second on commodity hardware. We also present a reference agent, trained with supervised pre-training and self-play, which reached the top 0.003% of the 1v1 human leaderboard after only 36 hours on a single H100 GPU. To accelerate learning, we incorporate potential-based reward shaping and memory features. Our contributions of a modular RTS benchmark and a competitive baseline agent provide an accessible yet challenging platform for advancing multi-agent reinforcement learning research. The documented code, together with examples and tutorials, is available at https://github.com/strakam/generals-bots.
中文: 本文基于Generals.io开发了一个实时策略游戏环境,其高性能智能体通过高效训练方法达到了顶尖人类玩家水平,为多智能体强化学习研究提供了模块化基准平台。
English: This paper presents a real-time strategy game environment based on Generals.io, featuring a high-performance reference agent that achieved top-tier human performance through efficient training methods, providing a modular benchmark for multi-agent reinforcement learning research.
Authors:Philipp Schlinge, Steffen Meinert, Martin Atzmueller
Abstract:
Prototype models are an important method for explainable artificial intelligence (XAI) and interpretable machine learning. In this paper, we perform an in-depth analysis of a set of prominent prototype models including ProtoPNet, ProtoPool and PIPNet. For their assessment, we apply a comprehensive set of metrics. In addition to applying standard metrics from literature, we propose several new metrics to further complement the analysis of model interpretability. In our experimentation, we apply the set of prototype models on a diverse set of datasets including fine-grained classification, Non-IID settings and multi-label classification to further contrast the performance. Furthermore, we also provide our code as an open-source library (https://github.com/uos-sis/quanproto), which facilitates simple application of the metrics itself, as well as extensibility -- providing the option for easily adding new metrics and models.
Chinese: 本文通过标准和新提出的指标,在多样化数据集上对可解释AI的原型模型进行全面评估,并发布了开源库以支持指标应用和扩展性。
English: This paper conducts a comprehensive evaluation of prototype-based models for explainable AI using both standard and newly proposed metrics across diverse datasets, while also releasing an open-source library for metric application and extensibility.
Authors:Cosimo Fiorini, Matteo Mosconi, Pietro Buzzega, Riccardo Salami, Simone Calderara
Abstract:
Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy. While existing approaches for aggregating client-specific classification heads and adapted backbone parameters require architectural modifications or loss function changes, our method uniquely leverages intrinsic training signals already available during standard optimization. We present LIVAR (Layer Importance and VARiance-based merging), which introduces: i) a variance-weighted classifier aggregation scheme using naturally emergent feature statistics, and ii) an explainability-driven LoRA merging technique based on SHAP analysis of existing update parameter patterns. Without any architectural overhead, LIVAR achieves state-of-the-art performance on multiple benchmarks while maintaining seamless integration with existing FL methods. This work demonstrates that effective model merging can be achieved solely through existing training signals, establishing a new paradigm for efficient federated model aggregation. The code is available at https://github.com/aimagelab/fed-mammoth.
Chinese: LIVAR提出了一种基于方差加权的分类器聚合和可解释性驱动的LoRA融合技术,通过利用现有训练信号,无需架构修改即可实现最先进的联邦学习性能。
English: LIVAR introduces a variance-weighted classifier aggregation and an explainability-driven LoRA merging technique, achieving state-of-the-art federated learning performance without architectural changes by utilizing existing training signals.
Authors:Eya Cherif, Arthur Ouaknine, Luke A. Brown, Phuong D. Dao, Kyle R. Kovach, Bing Lu, Daniel Mederer, Hannes Feilhauer, Teja Kattenborn, David Rolnick
Abstract:
Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.
中文: GreenHyperSpectra是一个跨传感器数据集,通过自监督学习实现了稳健的植物性状预测,在解决领域偏移和标签稀缺问题的同时超越了现有监督方法的性能。
English: GreenHyperSpectra is a cross-sensor dataset enabling robust plant trait prediction through self-supervised learning, outperforming supervised methods while addressing domain shifts and label scarcity.
Authors:Antonella Barisic Kulas, Andreja Jurasovic, Stjepan Bogdan
Abstract:
Thermal imaging from unmanned aerial vehicles (UAVs) holds significant potential for applications in search and rescue, wildlife monitoring, and emergency response, especially under low-light or obscured conditions. However, the scarcity of large-scale, diverse thermal aerial datasets limits the advancement of deep learning models in this domain, primarily due to the high cost and logistical challenges of collecting thermal data. In this work, we introduce a novel procedural pipeline for generating synthetic thermal images from an aerial perspective. Our method integrates arbitrary object classes into existing thermal backgrounds by providing control over the position, scale, and orientation of the new objects, while aligning them with the viewpoints of the background. We enhance existing thermal datasets by introducing new object categories, specifically adding a drone class in urban environments to the HIT-UAV dataset and an animal category to the MONET dataset. In evaluating these datasets for object detection task, we showcase strong performance across both new and existing classes, validating the successful expansion into new applications. Through comparative analysis, we show that thermal detectors outperform their visible-light-trained counterparts and highlight the importance of replicating aerial viewing angles. Project page: https://github.com/larics/thermal_aerial_synthetic.
Chinese: 本研究提出了一种生成合成热成像航拍图像的程序化流程,通过为数据集增加新物体类别,验证了热成像检测器在物体识别任务中优于可见光模型的性能。
English: This study introduces a procedural pipeline for generating synthetic thermal aerial images, enhancing datasets with new object classes and demonstrating superior object detection performance compared to visible-light models.
Authors:SeungYoon Han, Taeho Hwang, Sukmin Cho, Soyeong Jeong, Hoyun Song, Huije Lee, Jong C. Park
Abstract:
The rapid expansion of digital information and knowledge across structured and unstructured sources has heightened the importance of Information Retrieval (IR). While dense retrieval methods have substantially improved semantic matching for general queries, they consistently underperform on queries with explicit temporal constraints--often those containing numerical expressions and time specifiers such as ``in 2015.'' Existing approaches to Temporal Information Retrieval (TIR) improve temporal reasoning but often suffer from catastrophic forgetting, leading to reduced performance on non-temporal queries. To address this, we propose Time-Specifier Model Merging (TSM), a novel method that enhances temporal retrieval while preserving accuracy on non-temporal queries. TSM trains specialized retrievers for individual time specifiers and merges them in to a unified model, enabling precise handling of temporal constraints without compromising non-temporal retrieval. Extensive experiments on both temporal and non-temporal datasets demonstrate that TSM significantly improves performance on temporally constrained queries while maintaining strong results on non-temporal queries, consistently outperforming other baseline methods. Our code is available at https://github.com/seungyoonee/TSM .
中文摘要:本研究提出的时间指示符模型融合(TSM)方法通过将专门化检索器融合为统一模型,在保持非时序查询检索性能的同时,显著提升了时序约束查询的处理能力。
English Summary: The proposed Time-Specifier Model Merging (TSM) method effectively enhances temporal information retrieval while maintaining strong performance on non-temporal queries by merging specialized retrievers into a unified model.
Authors:Yan Hon Michael Chung, Donghyeok Choi
Abstract:
Manchu, a critically endangered language essential for understanding early modern Eastern Eurasian history, lacks effective OCR systems that can handle real-world historical documents. This study develops high-performing OCR systems by fine-tuning three open-source vision-language models (LLaMA-3.2-11B, Qwen2.5-VL-7B, Qwen2.5-VL-3B) on 60,000 synthetic Manchu word images using parameter-efficient training. LLaMA-3.2-11B achieved exceptional performance with 98.3\% word accuracy and 0.0024 character error rate on synthetic data, while crucially maintaining 93.1\% accuracy on real-world handwritten documents. Comparative evaluation reveals substantial advantages over traditional approaches: while a CRNN baseline achieved 99.8\% synthetic accuracy, it suffered severe degradation to 72.5\% on real documents. Our approach demonstrates effective synthetic-to-real domain transfer, providing a cost-effective solution deployable on accessible infrastructure. This work establishes a transferable framework for endangered language OCR that removes technical and financial barriers in digital humanities, enabling historians and linguists to process historical archives without specialized computing resources. Code and model weights are available at https://github.com/mic7ch1/ManchuAI-OCR.
中文摘要:本研究通过基于合成数据微调视觉语言模型,为濒危满语开发了高性能OCR系统,在合成和真实文档上均取得优异准确率,同时克服了传统方法在领域迁移中的局限性。
English Summary: This study develops high-performing OCR systems for the endangered Manchu language by fine-tuning vision-language models on synthetic data, achieving exceptional accuracy on both synthetic and real-world documents while overcoming traditional approaches' limitations in domain transfer.
Authors:Mahshid Shiri, Cigdem Beyan, Vittorio Murino
Abstract:
Medical anomaly detection (AD) is challenging due to diverse imaging modalities, anatomical variations, and limited labeled data. We propose a novel approach combining visual adapters and prompt learning with Partial Optimal Transport (POT) and contrastive learning (CL) to improve CLIP's adaptability to medical images, particularly for AD. Unlike standard prompt learning, which often yields a single representation, our method employs multiple prompts aligned with local features via POT to capture subtle abnormalities. CL further enforces intra-class cohesion and inter-class separation. Our method achieves state-of-the-art results in few-shot, zero-shot, and cross-dataset scenarios without synthetic data or memory banks. The code is available at https://github.com/mahshid1998/MADPOT.
Chinese: 本研究提出了一种新方法,结合视觉适配器和提示学习与部分最优传输及对比学习,以提升CLIP在医学异常检测中的适应性,无需合成数据即在多种场景下取得最优结果。
English: This study introduces a novel method that integrates visual adapters and prompt learning with Partial Optimal Transport and contrastive learning to enhance CLIP's performance in medical anomaly detection, achieving top results in various scenarios without synthetic data.
Authors:Guobin Zhu, Rui Zhou, Wenkang Ji, Hongyin Zhang, Donglin Wang, Shiyu Zhao
Abstract:
Multi-task multi-agent reinforcement learning (MT-MARL) has recently gained attention for its potential to enhance MARL's adaptability across multiple tasks. However, it is challenging for existing multi-task learning methods to handle complex problems, as they are unable to handle unrelated tasks and possess limited knowledge transfer capabilities. In this paper, we propose a hierarchical approach that efficiently addresses these challenges. The high-level module utilizes a skill graph, while the low-level module employs a standard MARL algorithm. Our approach offers two contributions. First, we consider the MT-MARL problem in the context of unrelated tasks, expanding the scope of MTRL. Second, the skill graph is used as the upper layer of the standard hierarchical approach, with training independent of the lower layer, effectively handling unrelated tasks and enhancing knowledge transfer capabilities. Extensive experiments are conducted to validate these advantages and demonstrate that the proposed method outperforms the latest hierarchical MAPPO algorithms. Videos and code are available at https://github.com/WindyLab/MT-MARL-SG
Chinese Summary: 本文提出了一种用于多任务多智能体强化学习的层次化方法,该方法在高层模块使用技能图,在低层模块采用标准MARL算法,有效处理不相关任务并提升知识迁移能力,性能优于现有分层方法。
English Summary: This paper introduces a hierarchical approach for multi-task multi-agent reinforcement learning that uses a skill graph in the high-level module and standard MARL algorithms in the low-level module, effectively handling unrelated tasks and improving knowledge transfer while outperforming existing hierarchical methods.
Authors:Miaojing Shi, Xiaowen Zhang, Zijie Yue, Yong Luo, Cairong Zhao, Li Li
Abstract:
Recent advances in large vision-language models (VLMs) have shown remarkable progress in solving the text-promptable object counting problem. Representative methods typically specify text prompts with object category information in images. This however is insufficient for training the model to accurately distinguish the number of objects in the counting task. To this end, we propose QUANet, which introduces novel quantity-oriented text prompts with a vision-text quantity alignment loss to enhance the model's quantity awareness. Moreover, we propose a dual-stream adaptive counting decoder consisting of a Transformer stream, a CNN stream, and a number of Transformer-to-CNN enhancement adapters (T2C-adapters) for density map prediction. The T2C-adapters facilitate the effective knowledge communication and aggregation between the Transformer and CNN streams. A cross-stream quantity ranking loss is proposed in the end to optimize the ranking orders of predictions from the two streams. Extensive experiments on standard benchmarks such as FSC-147, CARPK, PUCPR+, and ShanghaiTech demonstrate our model's strong generalizability for zero-shot class-agnostic counting. Code is available at https://github.com/viscom-tongji/QUANet
中文:QUANet通过引入面向数量的文本提示和双流解码器增强模型的数量感知能力,在多个基准测试中展现出优异的零样本计数泛化性能。
English: QUANet introduces quantity-oriented text prompts and a dual-stream decoder with cross-stream enhancement to improve object counting accuracy, demonstrating strong zero-shot performance across multiple benchmarks.
Authors:Boyuan Tian, Qizhe Gao, Siran Xianyu, Xiaotong Cui, Minjia Zhang
Abstract:
3D Gaussian splatting has become a prominent technique for representing and rendering complex 3D scenes, due to its high fidelity and speed advantages. However, the growing demand for large-scale models calls for effective compression to reduce memory and computation costs, especially on mobile and edge devices with limited resources. Existing compression methods effectively reduce 3D Gaussian parameters but often require extensive retraining or fine-tuning, lacking flexibility under varying compression constraints.
In this paper, we introduce FlexGaussian, a flexible and cost-effective method that combines mixed-precision quantization with attribute-discriminative pruning for training-free 3D Gaussian compression. FlexGaussian eliminates the need for retraining and adapts easily to diverse compression targets. Evaluation results show that FlexGaussian achieves up to 96.4% compression while maintaining high rendering quality (<1 dB drop in PSNR), and is deployable on mobile devices. FlexGaussian delivers high compression ratios within seconds, being 1.7-2.1x faster than state-of-the-art training-free methods and 10-100x faster than training-involved approaches. The code is being prepared and will be released soon at: https://github.com/Supercomputing-System-AI-Lab/FlexGaussian
中文: FlexGaussian提出了一种无需训练的3D高斯溅射压缩方法,通过混合精度量化和属性判别剪枝实现高达96.4%的压缩率,在保持渲染质量的同时可快速部署于移动设备。
English: FlexGaussian introduces a training-free compression method for 3D Gaussian splatting, combining mixed-precision quantization and pruning to achieve up to 96.4% compression with minimal quality loss and rapid deployment on mobile devices.
Authors:Yifan Yang, Peili Song, Enfan Lan, Dong Liu, Jingtai Liu
Abstract:
Category-level object pose estimation, which predicts the pose of objects within a known category without prior knowledge of individual instances, is essential in applications like warehouse automation and manufacturing. Existing methods relying on RGB images or point cloud data often struggle with object occlusion and generalization across different instances and categories. This paper proposes a multimodal-based keypoint learning framework (MK-Pose) that integrates RGB images, point clouds, and category-level textual descriptions. The model uses a self-supervised keypoint detection module enhanced with attention-based query generation, soft heatmap matching and graph-based relational modeling. Additionally, a graph-enhanced feature fusion module is designed to integrate local geometric information and global context. MK-Pose is evaluated on CAMERA25 and REAL275 dataset, and is further tested for cross-dataset capability on HouseCat6D dataset. The results demonstrate that MK-Pose outperforms existing state-of-the-art methods in both IoU and average precision without shape priors. Codes will be released at \href{https://github.com/yangyifanYYF/MK-Pose}{https://github.com/yangyifanYYF/MK-Pose}.
中文: 本文提出MK-Pose多模态关键点学习框架,融合RGB图像、点云和类别文本描述,在无需形状先验的情况下,显著提升了类别级物体姿态估计在基准数据集上的性能表现。
English: This paper introduces MK-Pose, a multimodal keypoint learning framework that integrates RGB images, point clouds, and textual descriptions to enhance category-level object pose estimation, achieving superior performance on benchmark datasets without relying on shape priors.
Authors:Hongjie Wu, Mingqin Zhang, Linchao He, Ji-Zhe Zhou, Jiancheng Lv
Abstract:
Diffusion models have shown remarkable promise for image restoration by leveraging powerful priors. Prominent methods typically frame the restoration problem within a Bayesian inference framework, which iteratively combines a denoising step with a likelihood guidance step. However, the interactions between these two components in the generation process remain underexplored. In this paper, we analyze the underlying gradient dynamics of these components and identify significant instabilities. Specifically, we demonstrate conflicts between the prior and likelihood gradient directions, alongside temporal fluctuations in the likelihood gradient itself. We show that these instabilities disrupt the generative process and compromise restoration performance. To address these issues, we propose Stabilized Progressive Gradient Diffusion (SPGD), a novel gradient management technique. SPGD integrates two synergistic components: (1) a progressive likelihood warm-up strategy to mitigate gradient conflicts; and (2) adaptive directional momentum (ADM) smoothing to reduce fluctuations in the likelihood gradient. Extensive experiments across diverse restoration tasks demonstrate that SPGD significantly enhances generation stability, leading to state-of-the-art performance in quantitative metrics and visually superior results. Code is available at https://github.com/74587887/SPGD.
中文摘要:本文提出SPGD这一新型梯度管理技术,通过解决梯度冲突和波动问题来稳定图像修复中的扩散模型,在多项任务中实现了最先进的性能。
English Summary: This paper introduces SPGD, a novel gradient management technique that stabilizes diffusion models for image restoration by addressing gradient conflicts and fluctuations, achieving state-of-the-art performance across various tasks.
Authors:Naoya Sogi, Takashi Shibata, Makoto Terao, Masanori Suganuma, Takayuki Okatani
Abstract:
Result diversification (RD) is a crucial technique in Text-to-Image Retrieval for enhancing the efficiency of a practical application. Conventional methods focus solely on increasing the diversity metric of image appearances. However, the diversity metric and its desired value vary depending on the application, which limits the applications of RD. This paper proposes a novel task called CDR-CA (Contextual Diversity Refinement of Composite Attributes). CDR-CA aims to refine the diversities of multiple attributes, according to the application's context. To address this task, we propose Multi-Source DPPs, a simple yet strong baseline that extends the Determinantal Point Process (DPP) to multi-sources. We model MS-DPP as a single DPP model with a unified similarity matrix based on a manifold representation. We also introduce Tangent Normalization to reflect contexts. Extensive experiments demonstrate the effectiveness of the proposed method. Our code is publicly available at https://github.com/NEC-N-SOGI/msdpp.
中文: 本文提出CDR-CA新任务,通过多源DPPs和切线归一化方法,实现了基于应用场景的复合属性多样性优化,实验证明该方法在文本-图像检索中具有显著效果。
English: This paper introduces CDR-CA, a novel task for refining attribute diversity in text-to-image retrieval based on application context, and proposes Multi-Source DPPs with Tangent Normalization as an effective solution, validated through extensive experiments.
Authors:Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen
Abstract:
Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.
中文摘要:本文提出门控记忆单元(GMU),构建了SambaY混合架构,在消除位置编码的同时显著提升了解码效率和长上下文性能,相比现有模型展现出更优的可扩展性和吞吐量。
English Summary: This paper introduces the Gated Memory Unit (GMU) to create SambaY, a hybrid architecture that enhances decoding efficiency and long-context performance while eliminating positional encoding, achieving superior scalability and throughput compared to existing models.
Authors:Qing Zhang, Guoquan Pei, Yan Wang
Abstract:
Medical Hyperspectral Imaging (MHSI) has emerged as a promising tool for enhanced disease diagnosis, particularly in computational pathology, offering rich spectral information that aids in identifying subtle biochemical properties of tissues. Despite these advantages, effectively fusing both spatial-dimensional and spectral-dimensional information from MHSIs remains challenging due to its high dimensionality and spectral redundancy inherent characteristics. To solve the above challenges, we propose a novel spatial-spectral omni-fusion network for hyperspectral image segmentation, named as Omni-Fuse. Here, we introduce abundant cross-dimensional feature fusion operations, including a cross-dimensional enhancement module that refines both spatial and spectral features through bidirectional attention mechanisms, a spectral-guided spatial query selection to select the most spectral-related spatial feature as the query, and a two-stage cross-dimensional decoder which dynamically guide the model to focus on the selected spatial query. Despite of numerous attention blocks, Omni-Fuse remains efficient in execution. Experiments on two microscopic hyperspectral image datasets show that our approach can significantly improve the segmentation performance compared with the state-of-the-art methods, with over 5.73 percent improvement in DSC. Code available at: https://github.com/DeepMed-Lab-ECNU/Omni-Fuse.
Chinese Summary: Omni-Fuse网络通过创新的跨维度融合技术,有效整合医学高光谱成像中的空间和光谱信息,在保持计算效率的同时显著提升分割精度,DSC指标提高超过5.73%。
English Summary: The Omni-Fuse network introduces innovative cross-dimensional fusion techniques to effectively integrate spatial and spectral information in medical hyperspectral imaging, significantly enhancing segmentation accuracy with over 5.73% DSC improvement while maintaining computational efficiency.
Authors:Xu Shaowu, Jia Xibin, Gao Junyu, Sun Qianmei, Chang Jing, Fan Chao
Abstract:
Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes \textbf{C}ross-\textbf{M}odal \textbf{D}ual-\textbf{C}ausal \textbf{L}earning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts.
CMDCL addresses cross-modal biases in text embeddings via textual causal intervention and removes confounders inherent in the visual modality through visual causal intervention guided by the debiased text.
These dual-causal interventions enable robust action representations to address LTAR challenges. Experimental results on three benchmarks including Charades, Breakfast and COIN, demonstrate the effectiveness of the proposed model. Our code is available at https://github.com/xushaowu/CMDCL.
中文:本文提出跨模态双因果学习(CMDCL)方法,通过双因果干预解决长期动作识别中的跨模态偏差和视觉混杂因素,在多个基准测试中验证了其有效性。
English: This paper introduces Cross-Modal Dual-Causal Learning (CMDCL), a method that uses dual-causal interventions to address cross-modal biases and visual confounders in long-term action recognition, demonstrating effectiveness across multiple benchmarks.
Authors:Yang Chen, Yueqi Duan, Haowen Sun, Jiwen Lu, Yap-Peng Tan
Abstract:
This paper proposes an adaptive margin contrastive learning method for 3D semantic segmentation on point clouds. Most existing methods use equally penalized objectives, which ignore the per-point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub-optimal models. To address this, we first design AMContrast3D, a method comprising contrastive learning into an ambiguity estimation framework, tailored to adaptive objectives for individual points based on ambiguity levels. As a result, our method promotes model training, which ensures the correctness of low-ambiguity points while allowing mistakes for high-ambiguity points. As ambiguities are formulated based on position discrepancies across labels, optimization during inference is constrained by the assumption that all unlabeled points are uniformly unambiguous, lacking ambiguity awareness. Inspired by the insight of joint training, we further propose AMContrast3D++ integrating with two branches trained in parallel, where a novel ambiguity prediction module concurrently learns point ambiguities from generated embeddings. To this end, we design a masked refinement mechanism that leverages predicted ambiguities to enable the ambiguous embeddings to be more reliable, thereby boosting segmentation performance and enhancing robustness. Experimental results on 3D indoor scene datasets, S3DIS and ScanNet, demonstrate the effectiveness of the proposed method. Code is available at https://github.com/YangChenApril/AMContrast3D.
中文: 本文提出AMContrast3D自适应边界对比学习方法,通过基于点歧义程度调整惩罚来改进3D点云分割,其增强版AMContrast3D++通过并行训练和掩码细化机制进一步提升了分割性能。
English: This paper introduces AMContrast3D, an adaptive margin contrastive learning method that enhances 3D point cloud segmentation by adjusting penalties based on point ambiguity levels, and its improved version AMContrast3D++ further refines performance through parallel training and a masked refinement mechanism.
Authors:Qibiao Wu, Yagang Wang, Qian Zhang
Abstract:
Manual annotation of airway regions in computed tomography images is a time-consuming and expertise-dependent task. Automatic airway segmentation is therefore a prerequisite for enabling rapid bronchoscopic navigation and the clinical deployment of bronchoscopic robotic systems. Although convolutional neural network methods have gained considerable attention in airway segmentation, the unique tree-like structure of airways poses challenges for conventional and deformable convolutions, which often fail to focus on fine airway structures, leading to missed segments and discontinuities. To address this issue, this study proposes a novel tubular feature extraction network, named TfeNet. TfeNet introduces a novel direction-aware convolution operation that first applies spatial rotation transformations to adjust the sampling positions of linear convolution kernels. The deformed kernels are then represented as line segments or polylines in 3D space. Furthermore, a tubular feature fusion module (TFFM) is designed based on asymmetric convolution and residual connection strategies, enhancing the network's focus on subtle airway structures. Extensive experiments conducted on one public dataset and two datasets used in airway segmentation challenges demonstrate that the proposed TfeNet achieves more accuracy and continuous airway structure predictions compared with existing methods. In particular, TfeNet achieves the highest overall score of 94.95% on the current largest airway segmentation dataset, Airway Tree Modeling(ATM22), and demonstrates advanced performance on the lung fibrosis dataset(AIIB23). The code is available at https://github.com/QibiaoWu/TfeNet.
中文摘要:本研究提出TfeNet这一新型管状特征提取网络,通过方向感知卷积和管状特征融合模块增强CT图像中气道分割的精度与连续性,在多个数据集上验证其优于现有方法的性能。
English Summary: This study introduces TfeNet, a novel tubular feature extraction network that enhances airway segmentation in CT images through direction-aware convolution and a tubular feature fusion module, achieving superior accuracy and continuity compared to existing methods.
Authors:Jeanette Schofield, Shuyu Tian, Hoang Thanh Thanh Truong, Maximilian Heil
Abstract:
Social media users often make scientific claims without citing where these claims come from, generating a need to verify these claims. This paper details work done by the DS@GT team for CLEF 2025 CheckThat! Lab Task 4b Scientific Claim Source Retrieval which seeks to find relevant scientific papers based on implicit references in tweets. Our team explored 6 different data augmentation techniques, 7 different retrieval and reranking pipelines, and finetuned a bi-encoder. Achieving an MRR@5 of 0.58, our team ranked 16th out of 30 teams for the CLEF 2025 CheckThat! Lab Task 4b, and improvement of 0.15 over the BM25 baseline of 0.43. Our code is available on Github at https://github.com/dsgt-arc/checkthat-2025-swd/tree/main/subtask-4b.
Chinese: DS@GT团队针对推文中的科学声明开发了检索方法,通过数据增强和检索流程在CLEF 2025竞赛中获得0.58的MRR@5评分。
English: The DS@GT team developed methods for retrieving scientific sources from tweets, employing data augmentation and retrieval pipelines to achieve an MRR@5 of 0.58 in the CLEF 2025 competition.
Authors:Taekyung Kim, Dongyoon Han, Byeongho Heo, Jeongeun Park, Sangdoo Yun
Abstract:
Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the expansion step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of ToBo over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales.
中文: 本文提出Token Bottleneck (ToBo)方法,通过将场景压缩为瓶颈标记并用最少补丁预测后续场景,实现了跨多种任务和现实环境的有效序列场景理解。
English: This paper introduces Token Bottleneck (ToBo), a self-supervised learning method that compresses scenes into bottleneck tokens and predicts subsequent scenes using minimal patches, enabling effective sequential scene understanding across various tasks and real-world applications.
Authors:Shan Shen, Shenglu Hua, Jiajun Zou, Jiawei Liu, Jianwang Zhai, Chuan Shi, Wenjian Yu
Abstract:
Graph representation learning on Analog-Mixed Signal (AMS) circuits is crucial for various downstream tasks, e.g., parasitic estimation. However, the scarcity of design data, the unbalanced distribution of labels, and the inherent diversity of circuit implementations pose significant challenges to learning robust and transferable circuit representations. To address these limitations, we propose CircuitGCL, a novel graph contrastive learning framework that integrates representation scattering and label rebalancing to enhance transferability across heterogeneous circuit graphs. CircuitGCL employs a self-supervised strategy to learn topology-invariant node embeddings through hyperspherical representation scattering, eliminating dependency on large-scale data. Simultaneously, balanced mean squared error (BMSE) and balanced softmax cross-entropy (BSCE) losses are introduced to mitigate label distribution disparities between circuits, enabling robust and transferable parasitic estimation. Evaluated on parasitic capacitance estimation (edge-level task) and ground capacitance classification (node-level task) across TSMC 28nm AMS designs, CircuitGCL outperforms all state-of-the-art (SOTA) methods, with the $R^2$ improvement of $33.64\% \sim 44.20\%$ for edge regression and F1-score gain of $0.9\times \sim 2.1\times$ for node classification. Our code is available at https://github.com/ShenShan123/CircuitGCL.
Chinese: CircuitGCL是一种新颖的图对比学习框架,通过结合超球面表示散射和标签重平衡技术,增强了模拟混合信号电路表示学习的可迁移性,在寄生参数估计任务中显著优于现有最优方法。
English: CircuitGCL is a novel graph contrastive learning framework that enhances transferability in AMS circuit representation learning by integrating hyperspherical representation scattering and label rebalancing techniques, achieving superior performance in parasitic estimation tasks compared to existing methods.
Authors:Themistoklis Vargiemezis, Catherine Gorlé
Abstract:
Accurate prediction of wind flow fields in urban canopies is crucial for ensuring pedestrian comfort, safety, and sustainable urban design. Traditional methods using wind tunnels and Computational Fluid Dynamics, such as Large-Eddy Simulations (LES), are limited by high costs, computational demands, and time requirements. This study presents a deep neural network (DNN) approach for fast and accurate predictions of urban wind flow fields, reducing computation time from an order of 10 hours on 32 CPUs for one LES evaluation to an order of 1 second on a single GPU using the DNN model. We employ a U-Net architecture trained on LES data including 252 synthetic urban configurations at seven wind directions ($0^{o}$ to $90^{o}$ in $15^{o}$ increments). The model predicts two key quantities of interest: mean velocity magnitude and streamwise turbulence intensity, at multiple heights within the urban canopy. The U-net uses 2D building representations augmented with signed distance functions and their gradients as inputs, forming a $256\times256\times9$ tensor. In addition, a Spatial Attention Module is used for feature transfer through skip connections. The loss function combines the root-mean-square error of predictions, their gradient magnitudes, and L2 regularization. Model evaluation on 50 test cases demonstrates high accuracy with an overall mean relative error of 9.3% for velocity magnitude and 5.2% for turbulence intensity. This research shows the potential of deep learning approaches to provide fast, accurate urban wind assessments essential for creating comfortable and safe urban environments. Code is available at https://github.com/tvarg/Urban-FlowUnet.git
中文: 本研究提出了一种基于U-Net架构的深度神经网络方法,能快速精准预测城市冠层风场,将计算时间从数小时缩短至秒级,为营造舒适安全的城市环境提供了高效解决方案。
English: This study introduces a deep neural network using U-Net architecture to rapidly and accurately predict urban wind flow fields, reducing computation time from hours to seconds while maintaining high accuracy compared to traditional methods.
Authors:Mingjin Zeng, Nan Ouyang, Wenkang Wan, Lei Ao, Qing Cai, Kai Sheng
Abstract:
Trajectory prediction for multi-agent interaction scenarios is a crucial challenge. Most advanced methods model agent interactions by efficiently factorized attention based on the temporal and agent axes. However, this static and foward modeling lacks explicit interactive spatio-temporal coordination, capturing only obvious and immediate behavioral intentions. Alternatively, the modern trajectory prediction framework refines the successive predictions by a fixed-anchor selection strategy, which is difficult to adapt in different future environments. It is acknowledged that human drivers dynamically adjust initial driving decisions based on further assumptions about the intentions of surrounding vehicles. Motivated by human driving behaviors, this paper proposes ILNet, a multi-agent trajectory prediction method with Inverse Learning (IL) attention and Dynamic Anchor Selection (DAS) module. IL Attention employs an inverse learning paradigm to model interactions at neighboring moments, introducing proposed intentions to dynamically encode the spatio-temporal coordination of interactions, thereby enhancing the model's ability to capture complex interaction patterns. Then, the learnable DAS module is proposed to extract multiple trajectory change keypoints as anchors in parallel with almost no increase in parameters. Experimental results show that the ILNet achieves state-of-the-art performance on the INTERACTION and Argoverse motion forecasting datasets. Particularly, in challenged interaction scenarios, ILNet achieves higher accuracy and more multimodal distributions of trajectories over fewer parameters. Our codes are available at https://github.com/mjZeng11/ILNet.
中文摘要:本文提出ILNet多智能体轨迹预测方法,通过逆向学习注意力机制和动态锚点选择模块,有效建模交互时空协调并适应不同环境,以更少参数实现了最优性能。
English Summary: The paper introduces ILNet, a multi-agent trajectory prediction method that uses Inverse Learning attention and a Dynamic Anchor Selection module to better model spatio-temporal interactions and adapt to different environments, achieving state-of-the-art performance with fewer parameters.
Authors:Huisheng Wang, Zhuoshi Pan, Hangjing Zhang, Mingxiao Liu, Hanqing Gao, H. Vicky Zhao
Abstract:
Aligning Large Language Models (LLMs) with investor decision-making processes under herd behavior is a critical challenge in behavioral finance, which grapples with a fundamental limitation: the scarcity of real-user data needed for Supervised Fine-Tuning (SFT). While SFT can bridge the gap between LLM outputs and human behavioral patterns, its reliance on massive authentic data imposes substantial collection costs and privacy risks. We propose InvestAlign, a novel framework that constructs high-quality SFT datasets by leveraging theoretical solutions to similar and simple optimal investment problems rather than complex scenarios. Our theoretical analysis demonstrates that training LLMs with InvestAlign-generated data achieves faster parameter convergence than using real-user data, suggesting superior learning efficiency. Furthermore, we develop InvestAgent, an LLM agent fine-tuned with InvestAlign, which demonstrates significantly closer alignment to real-user data than pre-SFT models in both simple and complex investment problems. This highlights our proposed InvestAlign as a promising approach with the potential to address complex optimal investment problems and align LLMs with investor decision-making processes under herd behavior. Our code is publicly available at https://github.com/thu-social-network-research-group/InvestAlign.
中文摘要:InvestAlign框架通过利用简单投资问题的理论解生成高质量监督微调数据集,解决了在羊群效应下将大语言模型与投资者决策对齐的难题,相比传统方法实现了更快的参数收敛和更接近真实用户数据的对齐效果。
English Summary: The InvestAlign framework addresses the challenge of aligning LLMs with investor herd behavior by generating high-quality SFT datasets from theoretical solutions to simple investment problems, achieving faster convergence and closer alignment to real-user data than traditional methods.
Authors:Yunrui Zhang, Gustavo Batista, Salil S. Kanhere
Abstract:
Deep neural networks often produce miscalibrated probability estimates, leading to overconfident predictions. A common approach for calibration is fitting a post-hoc calibration map on unseen validation data that transforms predicted probabilities. A key desirable property of the calibration map is instance-wise monotonicity (i.e., preserving the ranking of probability outputs). However, most existing post-hoc calibration methods do not guarantee monotonicity. Previous monotonic approaches either use an under-parameterized calibration map with limited expressive ability or rely on black-box neural networks, which lack interpretability and robustness. In this paper, we propose a family of novel monotonic post-hoc calibration methods, which employs a constrained calibration map parameterized linearly with respect to the number of classes. Our proposed approach ensures expressiveness, robustness, and interpretability while preserving the relative ordering of the probability output by formulating the proposed calibration map as a constrained optimization problem. Our proposed methods achieve state-of-the-art performance across datasets with different deep neural network models, outperforming existing calibration methods while being data and computation-efficient. Our code is available at https://github.com/YunruiZhang/Calibration-by-Constrained-Transformation
中文摘要:本文提出了一种新颖的单调后处理校准方法,通过将校准映射构建为约束优化问题,在保证表达能力、鲁棒性和可解释性的同时,实现了跨数据集和模型的最优校准性能。
English Summary: This paper introduces a novel family of monotonic post-hoc calibration methods that ensure expressiveness, robustness, and interpretability by formulating calibration maps as constrained optimization problems, achieving state-of-the-art performance across various datasets and models.
Authors:Fuhuan Li, Zhihui Du, David A. Bader
Abstract:
The rise of graph data in various fields calls for efficient and scalable community detection algorithms. In this paper, we present parallel implementations of two widely used algorithms: Label Propagation and Louvain, specifically designed to leverage the capabilities of Arachne, which is a Python-accessible open-source framework for large-scale graph analysis. Our implementations achieve substantial speedups over existing Python-based tools like NetworkX and igraph, which lack efficient parallelization, and are competitive with parallel frameworks such as NetworKit. Experimental results show that Arachne-based methods outperform these baselines, achieving speedups of up to 710x over NetworkX, 75x over igraph, and 12x over NetworKit. Additionally, we analyze the scalability of our implementation under varying thread counts, demonstrating how different phases contribute to overall performance gains of the parallel Louvain algorithm. Arachne, including our community detection implementation, is open-source and available at https://github.com/Bears-R-Us/arkouda-njit .
中文: 本文基于Arachne框架实现了标签传播与Louvain社区发现算法的并行版本,相比NetworkX提速高达710倍,性能优于现有工具,并通过线程扩展性分析展示了并行Louvain算法的性能提升机制。
English: This paper introduces parallel implementations of Label Propagation and Louvain community detection algorithms using the Arachne framework, achieving speedups up to 710x over NetworkX and demonstrating competitive performance with other tools while analyzing scalability across thread counts.
Authors:Atieh Barati Nia, Mohammad Dindoost, David A. Bader
Abstract:
Large Language Models (LLMs) are increasingly used to automate software development, yet most prior evaluations focus on functional correctness or high-level languages such as Python. As one of the first systematic explorations of LLM-assisted software performance engineering, we present a comprehensive study of LLMs' ability to generate efficient C implementations of graph-analysis routines -- code that must satisfy stringent runtime and memory constraints. This emerging field of LLM-assisted algorithm engineering holds significant promise, as these models may possess the capability to design novel approaches that improve existing algorithms and their implementations. Eight state-of-the-art models (OpenAI ChatGPT o3 and o4-mini-high, Anthropic Claude 4 Sonnet and Sonnet Extended, Google Gemini 2.5 Flash and Pro, xAI Grok 3-Think, and DeepSeek DeepThink R1) are benchmarked using two distinct approaches. The first approach evaluates the ability of LLMs to generate algorithms that outperform existing benchmarks. The second approach assesses their capability to generate graph algorithms for integration into performance-critical systems. The results show that Claude Sonnet 4 Extended achieves superior performance in ready-to-use code generation and efficiency, outperforming human-written baselines in triangle counting. Although our findings demonstrate that contemporary LLMs excel in optimizing and integrating established algorithms, the potential for these models to eventually invent transformative algorithmic techniques represents a compelling frontier for future research. We provide prompts, generated code, and measurement scripts to promote reproducible research in this rapidly evolving domain. All of the source code is available on GitHub at https://github.com/Bader-Research/LLM-triangle-counting/.
中文摘要:本研究评估了八种先进大语言模型生成高效C语言图算法的能力,发现Claude Sonnet 4 Extended在生成可直接使用的性能优化代码方面表现最佳,在三角形计数任务中超越了人工编写的基准代码。
English Summary: This study evaluates eight advanced Large Language Models for generating efficient C implementations of graph algorithms, finding Claude Sonnet 4 Extended most effective at producing performance-optimized code that surpasses human baselines in triangle counting tasks.
Authors:Rafiu Adekoya Badekale, Adewale Akinfaderin
Abstract:
Understanding how policy language evolves over time is critical for assessing global responses to complex challenges such as climate change. Temporal analysis helps stakeholders, including policymakers and researchers, to evaluate past priorities, identify emerging themes, design governance strategies, and develop mitigation measures. Traditional approaches, such as manual thematic coding, are time-consuming and limited in capturing the complex, interconnected nature of global policy discourse. With the increasing relevance of unsupervised machine learning, these limitations can be addressed, particularly under high-volume, complex, and high-dimensional data conditions. In this work, we explore a novel approach that applies the dynamic embedded topic model (DETM) to analyze the evolution of global climate policy discourse. A probabilistic model designed to capture the temporal dynamics of topics over time. We collected a corpus of United Nations Framework Convention on Climate Change (UNFCCC) policy decisions from 1995 to 2023, excluding 2020 due to the postponement of COP26 as a result of the COVID-19 pandemic. The model reveals shifts from early emphases on greenhouse gases and international conventions to recent focuses on implementation, technical collaboration, capacity building, finance, and global agreements. Section 3 presents the modeling pipeline, including preprocessing, model training, and visualization of temporal word distributions. Our results show that DETM is a scalable and effective tool for analyzing the evolution of global policy discourse. Section 4 discusses the implications of these findings and we concluded with future directions and refinements to extend this approach to other policy domains.
中文: 本研究采用动态嵌入主题模型分析1995-2023年联合国气候变化框架公约政策文件,揭示了全球气候政策从温室气体讨论向实施合作主题的演变趋势。
English: This study introduces a dynamic embedded topic model (DETM) to effectively analyze the evolution of global climate policy discourse in UNFCCC documents from 1995-2023, revealing thematic shifts from greenhouse gas discussions to implementation-focused topics.
Authors:Niloy Sikder, Paul Zerr, Mahdad Jafarzadeh Esfahani, Martin Dresler, Matthias Krauledat
Abstract:
Electroencephalography (EEG) allows monitoring of brain activity, providing insights into the functional dynamics of various brain regions and their roles in cognitive processes. EEG is a cornerstone in sleep research, serving as the primary modality of polysomnography, the gold standard in the field. However, EEG signals are prone to artifacts caused by both internal (device-specific) factors and external (environmental) interferences. As sleep studies are becoming larger, most rely on automatic sleep staging, a process highly susceptible to artifacts, leading to erroneous sleep scores. This paper addresses this challenge by introducing eegFloss, an open-source Python package to utilize eegUsability, a novel machine learning (ML) model designed to detect segments with artifacts in sleep EEG recordings. eegUsability has been trained and evaluated on manually artifact-labeled EEG data collected from 15 participants over 127 nights using the Zmax headband. It demonstrates solid overall classification performance (F1-score is approximately 0.85, Cohens kappa is 0.78), achieving a high recall rate of approximately 94% in identifying channel-wise usable EEG data, and extends beyond Zmax. Additionally, eegFloss offers features such as automatic time-in-bed detection using another ML model named eegMobility, filtering out certain artifacts, and generating hypnograms and sleep statistics. By addressing a fundamental challenge faced by most sleep studies, eegFloss can enhance the precision and rigor of their analysis as well as the accuracy and reliability of their outcomes.
中文: 脑电图在睡眠研究中至关重要但易受伪迹干扰,影响自动分期准确性,因此本文推出开源工具eegFloss,通过新型机器学习模型检测伪迹,提升睡眠分析的精确度和可靠性。
English: EEG is vital for sleep research but prone to artifacts that disrupt automated staging, so this paper introduces eegFloss, an open-source Python package using a novel ML model to detect artifacts and improve analysis accuracy and reliability.
Authors:Yize Chen, Baosen Zhang
Abstract:
Recent boom in foundation models and AI computing have raised growing concerns on the power and energy trajectories of large-scale data centers. This paper focuses on the voltage issues caused by volatile and intensity of data center power demand, which also aligns with recent observations of more frequent voltage disturbances in power grids. To address these data center integration challenges, we propose a dynamic voltage control scheme by harnessing data center's load regulation capabilities. By taking local voltage measurements and adjusting power injections at each data center buses through the dynamic voltage and frequency scaling (DVFS) scheme, we are able to maintain safe voltage magnitude in a distributed fashion with higher data center computing load. Simulations using real large language model (LLM) inference load validate the effectiveness of our proposed mechanism. Both the LLM power data and proposed control scheme are open sourced.
中文: 本文针对数据中心能耗波动引发的电网电压问题,提出一种分布式动态电压控制方案,通过本地测量和动态电压频率调整技术维持安全电压,并利用真实大语言模型推理负载验证了其有效性。
English: This paper addresses voltage instability in power grids caused by the fluctuating energy demands of data centers by proposing a distributed dynamic voltage control scheme that uses local measurements and DVFS to maintain safe voltage levels, validated through simulations with real LLM inference loads.
Authors:Emerson P. Grabke, Babak Taati, Masoom A. Haider
Abstract:
Objective: Latent diffusion models (LDMs) could mitigate data scarcity challenges affecting machine learning development for medical image interpretation. The recent CCELLA LDM improved prostate cancer detection performance using synthetic MRI for classifier training but was limited to the axial T2-weighted (AxT2) sequence, did not investigate inter-institutional domain shift, and prioritized radiology over histopathology outcomes. We propose CCELLA++ to address these limitations and improve clinical utility. Methods: CCELLA++ expands CCELLA for simultaneous biparametric prostate MRI (bpMRI) generation, including the AxT2, high b-value diffusion series (HighB) and apparent diffusion coefficient map (ADC). Domain adaptation was investigated by pretraining classifiers on real or LDM-generated synthetic data from an internal institution, followed with fine-tuning on progressively smaller fractions of an out-of-distribution, external dataset. Results: CCELLA++ improved 3D FID for HighB and ADC but not AxT2 (0.013, 0.012, 0.063 respectively) sequences compared to CCELLA (0.060). Classifier pretraining with CCELLA++ bpMRI outperformed real bpMRI in AP and AUC for all domain adaptation scenarios. CCELLA++ pretraining achieved highest classifier performance below 50% (n=665) external dataset volume. Conclusion: Synthetic bpMRI generated by our method can improve downstream classifier generalization and performance beyond real bpMRI or CCELLA-generated AxT2-only images. Future work should seek to quantify medical image sample quality, balance multi-sequence LDM training, and condition the LDM with additional information. Significance: The proposed CCELLA++ LDM can generate synthetic bpMRI that outperforms real data for domain adaptation with a limited target institution dataset. Our code is available at https://github.com/grabkeem/CCELLA-plus-plus
中文:CCELLA++通过生成合成双参数MRI序列提升前列腺癌检测能力,在有限外部数据集微调场景下,其分类器性能与领域适应效果均优于真实数据。
English: CCELLA++ enhances prostate cancer detection by generating synthetic biparametric MRI sequences that improve classifier performance and domain adaptation, outperforming real data when fine-tuning with limited external datasets.
Authors:Kenneth Odoh
Abstract:
We present a privacy-preserving telemetry aggregation scheme. Our underlying frequency estimation routine works within the framework of differential privacy. The design philosophy follows a client-server architecture. Furthermore, the system uses a local differential privacy scheme where data gets randomized on the client before submitting the request to the resource server. This scheme allows for data analysis on de-identified data by carefully adding noise to prevent re-identification attacks, thereby facilitating public data release without compromising the identifiability of the individual record. This work further enhances privacy guarantees by leveraging Oblivious HTTP (OHTTP) to achieve increased privacy protection for data in transit that addresses pre-existing privacy vulnerabilities in raw HTTP. We provide an implementation that focuses on frequency estimation with a histogram of a known dictionary. Our resulting formulation based on OHTTP has provided stricter privacy safeguards when compared to trusting an organization to manually delete identifying information from the client's request in the ingestor as deployed in reference work~\cite{apple2017}. Code available at https://github.com/kenluck2001/miscellaneous/tree/master/src/Privacy-Preserving-Telemetry.
中文: 本文提出了一种隐私保护的遥测数据聚合方案,采用本地差分隐私和Oblivious HTTP技术,在数据传输过程中确保匿名性并防止重识别攻击,同时实现频率估计并提升隐私保障。
English: This paper introduces a privacy-preserving telemetry aggregation scheme that utilizes local differential privacy and Oblivious HTTP to ensure data anonymity during transmission and prevent re-identification, while enabling frequency estimation with enhanced security.
Authors:Zhang Li, Biao Yang, Qiang Liu, Shuo Zhang, Zhiyin Ma, Liang Yin, Linger Deng, Yabo Sun, Yuliang Liu, Xiang Bai
Abstract:
While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.
Chinese: LIRA框架通过融合语义增强特征提取和交错局部视觉耦合,解决了大型多模态模型在分割不准确和理解幻觉方面的局限,在两项任务中均实现了最先进的性能。
English: The LIRA framework addresses the limitations of inaccurate segmentation and hallucinated comprehension in large multi-modal models by integrating semantic-enhanced feature extraction and interleaved local visual coupling, achieving state-of-the-art performance in both tasks.
Authors:Ali Nasiri-Sarvi, Hassan Rivaz, Mahdi S. Hosseini
Abstract:
Understanding how different AI models encode the same high-level concepts, such as objects or attributes, remains challenging because each model typically produces its own isolated representation. Existing interpretability methods like Sparse Autoencoders (SAEs) produce latent concepts individually for each model, resulting in incompatible concept spaces and limiting cross-model interpretability. To address this, we introduce SPARC (Sparse Autoencoders for Aligned Representation of Concepts), a new framework that learns a single, unified latent space shared across diverse architectures and modalities (e.g., vision models like DINO, and multimodal models like CLIP). SPARC's alignment is enforced through two key innovations: (1) a Global TopK sparsity mechanism, ensuring all input streams activate identical latent dimensions for a given concept; and (2) a Cross-Reconstruction Loss, which explicitly encourages semantic consistency between models. On Open Images, SPARC dramatically improves concept alignment, achieving a Jaccard similarity of 0.80, more than tripling the alignment compared to previous methods. SPARC creates a shared sparse latent space where individual dimensions often correspond to similar high-level concepts across models and modalities, enabling direct comparison of how different architectures represent identical concepts without requiring manual alignment or model-specific analysis. As a consequence of this aligned representation, SPARC also enables practical applications such as text-guided spatial localization in vision-only models and cross-model/cross-modal retrieval. Code and models are available at https://github.com/AtlasAnalyticsLab/SPARC.
中文: SPARC框架通过全局稀疏性和跨模型重构损失,为不同架构的AI模型创建了统一的稀疏潜在空间,显著提升了概念对齐度,实现了跨模型的直接概念比较和实际应用。
English: SPARC introduces a unified framework that creates a shared sparse latent space across diverse AI models, enabling direct comparison of concept representations and significantly improving alignment through global sparsity and cross-reconstruction mechanisms.
Authors:Jiangzhong Cao, Zekai Zeng, Xu Zhang, Huan Zhang, Chunling Fan, Gangyi Jiang, Weisi Lin
Abstract:
High-quality underwater images are essential for both machine vision tasks and viewers with their aesthetic appeal.However, the quality of underwater images is severely affected by light absorption and scattering. Deep learning-based methods for Underwater Image Enhancement (UIE) have achieved good performance. However, these methods often overlook considering human perception and lack sufficient constraints within the solution space. Consequently, the enhanced images often suffer from diminished perceptual quality or poor content restoration.To address these issues, we propose a UIE method with a Contrastive Language-Image Pre-Training (CLIP) perception loss module and curriculum contrastive regularization. Above all, to develop a perception model for underwater images that more aligns with human visual perception, the visual semantic feature extraction capability of the CLIP model is leveraged to learn an appropriate prompt pair to map and evaluate the quality of underwater images. This CLIP perception model is then incorporated as a perception loss module into the enhancement network to improve the perceptual quality of enhanced images. Furthermore, the CLIP perception model is integrated with the curriculum contrastive regularization to enhance the constraints imposed on the enhanced images within the CLIP perceptual space, mitigating the risk of both under-enhancement and over-enhancement. Specifically, the CLIP perception model is employed to assess and categorize the learning difficulty level of negatives in the regularization process, ensuring comprehensive and nuanced utilization of distorted images and negatives with varied quality levels. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in terms of visual quality and generalization ability.
Chinese: 本研究提出了一种结合CLIP感知损失模块和课程对比正则化的水下图像增强方法,通过更贴合人类视觉感知的方式提升图像质量,有效避免增强不足或过度的问题。
English: This study introduces an underwater image enhancement method that integrates a CLIP-based perception loss module and curriculum contrastive regularization to better align with human visual perception and improve image quality by addressing under- and over-enhancement issues.
Authors:Keyan Chen, Chenyang Liu, Bowen Chen, Jiafan Zhang, Zhengxia Zou, Zhenwei Shi
Abstract:
Referring Remote Sensing Image Segmentation provides a flexible and fine-grained framework for remote sensing scene analysis via vision-language collaborative interpretation. Current approaches predominantly utilize a three-stage pipeline encompassing dual-modal encoding, cross-modal interaction, and pixel decoding. These methods demonstrate significant limitations in managing complex semantic relationships and achieving precise cross-modal alignment, largely due to their coupled processing mechanism that conflates target localization with boundary delineation. This architectural coupling amplifies error propagation under semantic ambiguity while restricting model generalizability and interpretability. To address these issues, we propose RSRefSeg 2, a decoupling paradigm that reformulates the conventional workflow into a collaborative dual-stage framework: coarse localization followed by fine segmentation. RSRefSeg 2 integrates CLIP's cross-modal alignment strength with SAM's segmentation generalizability through strategic foundation model collaboration. Specifically, CLIP is employed as the dual-modal encoder to activate target features within its pre-aligned semantic space and generate localization prompts. To mitigate CLIP's misactivation challenges in multi-entity scenarios described by referring texts, a cascaded second-order prompter is devised, which enhances precision through implicit reasoning via decomposition of text embeddings into complementary semantic subspaces. These optimized semantic prompts subsequently direct the SAM to generate pixel-level refined masks, thereby completing the semantic transmission pipeline. Extensive experiments (RefSegRS, RRSIS-D, and RISBench) demonstrate that RSRefSeg 2 surpasses contemporary methods in segmentation accuracy (+~3% gIoU) and complex semantic interpretation. Code is available at: https://github.com/KyanChen/RSRefSeg2.
中文摘要:RSRefSeg 2通过解耦的双阶段框架,将CLIP的跨模态对齐能力与SAM的分割泛化性相结合,显著提升了遥感图像分割精度和复杂语义理解能力。
English Summary: RSRefSeg 2 introduces a decoupled dual-stage framework that combines CLIP's cross-modal alignment with SAM's segmentation capabilities to significantly improve remote sensing image segmentation accuracy and semantic interpretation.
Authors:Aleksandar JevtiÄ, Christoph Reich, Felix Wimbauer, Oliver Hahn, Christian Rupprecht, Stefan Roth, Daniel Cremers
Abstract:
Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.
中文: SceneDINO是一种新颖的无监督语义场景补全方法,利用自监督学习和多视角一致性从单张图像推断三维几何和语义,无需真实标注,实现了最先进的分割精度。
English: SceneDINO is a novel unsupervised method for semantic scene completion that leverages self-supervised learning and multi-view consistency to infer 3D geometry and semantics from single images without ground-truth annotations, achieving state-of-the-art segmentation accuracy.
Authors:Zhiyuan Peng, Ting-ruen Wei, Tingyu Song, Yilun Zhao
Abstract:
Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose \ours\footnote{https://github.com/zhiyuanpeng/EER-FLOPs.} for LLM-based rerankers: RPP (ranking metrics per PetaFLOP), measuring how much ranking quality (e.g., NDCG or MRR) a method achieves per PetaFLOP, and QPP (queries per PetaFLOP), measuring how many queries can be processed per PetaFLOP. Accompanied by the new metrics, an interpretable FLOPs estimator is developed to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architectures, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.
中文: 本文针对现有评估指标在衡量基于大语言模型的排序器效率时的不足,提出了RPP和QPP两项新指标,通过每PetaFLOP的排序质量和查询处理量来更准确地评估效率与效果的平衡,并开发了可解释的FLOPs估算器。
English: This paper introduces two new efficiency metrics, RPP and QPP, to better evaluate the trade-off between performance and computational cost in LLM-based rerankers, addressing the limitations of existing proxy metrics and providing an interpretable FLOPs estimator for practical assessment.
Authors:Modi Shi, Li Chen, Jin Chen, Yuxiang Lu, Chiming Liu, Guanghui Ren, Ping Luo, Di Huang, Maoqing Yao, Hongyang Li
Abstract:
Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.
中文: 本研究挑战机器人操作中"越多样越好"的传统认知,揭示任务多样性对迁移学习最为关键,单平台数据即可实现高效跨平台适应,而专家多样性会因操作速度差异干扰策略学习,据此提出的去偏方法使性能提升15%。
English: This study challenges the "more diverse is better" assumption in robotic manipulation by revealing that task diversity is most critical for transfer learning, single-embodiment data enables efficient cross-platform adaptation, and expert diversity can hinder performance due to velocity variations, leading to a debiasing method that boosts performance by 15%.
Authors:Ayush Parikh, Hoang Thanh Thanh Truong, Jeanette Schofield, Maximilian Heil
Abstract:
In this paper, we, as the DS@GT team for CLEF 2025 CheckThat! Task 4a Scientific Web Discourse Detection, present the methods we explored for this task. For this multiclass classification task, we determined if a tweet contained a scientific claim, a reference to a scientific study or publication, and/or mentions of scientific entities, such as a university or a scientist. We present 3 modeling approaches for this task: transformer finetuning, few-shot prompting of LLMs, and a combined ensemble model whose design was informed by earlier experiments. Our team placed 7th in the competition, achieving a macro-averaged F1 score of 0.8611, an improvement over the DeBERTaV3 baseline of 0.8375. Our code is available on Github at https://github.com/dsgt-arc/checkthat-2025-swd/tree/main/subtask-4a.
中文: DS@GT团队针对推文中的科学话语检测任务,探索了三种建模方法——Transformer微调、大语言模型的少样本提示以及集成模型,在CLEF 2025竞赛中获得第七名,其宏观平均F1分数达到0.8611。
English: The DS@GT team developed three modeling approaches—transformer fine-tuning, few-shot prompting of LLMs, and an ensemble model—for detecting scientific discourse in tweets, achieving a 7th-place finish with a macro-averaged F1 score of 0.8611 in the CLEF 2025 competition.
Authors:Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian
Abstract:
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model's expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.
大型语言模型正通过开发潜在推理方法超越显式思维链推理,在连续隐藏状态中完全执行多步推理,从而实现更高效复杂的认知过程。
Large Language Models are advancing beyond explicit chain-of-thought reasoning by developing latent reasoning methods that perform multi-step inference entirely within continuous hidden states, enabling more efficient and complex cognitive processes.
Authors:Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, Zeya Ahmad
Abstract:
Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.
中文:UQLM是一个基于不确定性量化技术的Python工具包,通过计算置信度分数来检测大语言模型的幻觉问题,从而提升模型输出的可靠性。
English: UQLM is a Python toolkit that employs uncertainty quantification techniques to detect hallucinations in Large Language Models by providing confidence scores, thereby improving output reliability.
Authors:Maximilian Heil, Aleksandar Pramov
Abstract:
Numerical claims, statements involving quantities, comparisons, and temporal references, pose unique challenges for automated fact-checking systems. In this study, we evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline. We investigate three key factors: (1) the impact of more evidences with longer input context windows using ModernBERT, (2) the effect of right-to-left (R2L) tokenization, and (3) their combined influence on classification performance. Contrary to prior findings in arithmetic reasoning tasks, R2L tokenization does not boost natural language inference (NLI) of numerical tasks. A longer context window does also not enhance veracity performance either, highlighting evidence quality as the dominant bottleneck. Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025. Our code is available at https://github.com/dsgt-arc/checkthat-2025-numerical.
中文: 本研究使用QuanTemp数据集评估数值声明验证的建模策略,发现右向左分词和更长上下文窗口均未提升性能,证据质量是主要瓶颈,同时取得了0.57的竞争性F1分数。
English: This study evaluates modeling strategies for numerical claim verification using the QuanTemp dataset, finding that neither right-to-left tokenization nor longer context windows improve performance, with evidence quality being the key bottleneck, while achieving a competitive F1 score of 0.57.
Authors:Maximilian Heil, Dionne Bang
Abstract:
This paper presents our submission to Task 1, Subjectivity Detection, of the CheckThat! Lab at CLEF 2025. We investigate the effectiveness of transfer-learning and stylistic data augmentation to improve classification of subjective and objective sentences in English news text. Our approach contrasts fine-tuning of pre-trained encoders and transfer-learning of fine-tuned transformer on related tasks. We also introduce a controlled augmentation pipeline using GPT-4o to generate paraphrases in predefined subjectivity styles. To ensure label and style consistency, we employ the same model to correct and refine the generated samples. Results show that transfer-learning of specified encoders outperforms fine-tuning general-purpose ones, and that carefully curated augmentation significantly enhances model robustness, especially in detecting subjective content. Our official submission placed us $16^{th}$ of 24 participants. Overall, our findings underscore the value of combining encoder specialization with label-consistent augmentation for improved subjectivity detection. Our code is available at https://github.com/dsgt-arc/checkthat-2025-subject.
中文总结: 本研究通过迁移学习和GPT-4o风格数据增强改进英文新闻主观性检测,发现专用编码器与精选数据增强相结合能显著提升模型性能。
English Summary: This study explores transfer learning and stylistic data augmentation using GPT-4o to enhance subjectivity detection in English news, finding that specialized encoders with curated data augmentation significantly improve model performance.
Authors:Haoyu Wang, Lei Zhang, Wei Wei, Chen Ding, Yanning Zhang
Abstract:
Diffusion models has underpinned much recent advances of dataset augmentation in various computer vision tasks. However, when involving generating multi-object images as real scenarios, most existing methods either rely entirely on text condition, resulting in a deviation between the generated objects and the original data, or rely too much on the original images, resulting in a lack of diversity in the generated images, which is of limited help to downstream tasks. To mitigate both problems with one stone, we propose a prompt-free conditional diffusion framework for multi-object image augmentation. Specifically, we introduce a local-global semantic fusion strategy to extract semantics from images to replace text, and inject knowledge into the diffusion model through LoRA to alleviate the category deviation between the original model and the target dataset. In addition, we design a reward model based counting loss to assist the traditional reconstruction loss for model training. By constraining the object counts of each category instead of pixel-by-pixel constraints, bridging the quantity deviation between the generated data and the original data while improving the diversity of the generated data. Experimental results demonstrate the superiority of the proposed method over several representative state-of-the-art baselines and showcase strong downstream task gain and out-of-domain generalization capabilities. Code is available at \href{https://github.com/00why00/PFCD}{here}.
中文: 本文提出了一种无需提示的条件扩散框架,通过局部-全局语义融合策略替代文本提示,并结合计数损失来增强多目标图像生成的多样性和数据对齐性,在下游任务中展现出卓越性能。
English: This paper introduces a prompt-free conditional diffusion framework that enhances multi-object image augmentation by replacing text prompts with a local-global semantic fusion strategy and incorporating a counting loss to improve diversity and alignment with original data, demonstrating superior performance in downstream tasks.
Authors:Zhihao Chen, Tao Chen, Chenhui Wang, Qi Gao, Huidong Xie, Chuang Niu, Ge Wang, Hongming Shan
Abstract:
Low-dose computed tomography (LDCT) reduces radiation exposure but often degrades image quality, potentially compromising diagnostic accuracy. Existing deep learning-based denoising methods focus primarily on pixel-level mappings, overlooking the potential benefits of high-level semantic guidance. Recent advances in vision-language models (VLMs) suggest that language can serve as a powerful tool for capturing structured semantic information, offering new opportunities to improve LDCT reconstruction. In this paper, we introduce LangMamba, a Language-driven Mamba framework for LDCT denoising that leverages VLM-derived representations to enhance supervision from normal-dose CT (NDCT). LangMamba follows a two-stage learning strategy. First, we pre-train a Language-guided AutoEncoder (LangAE) that leverages frozen VLMs to map NDCT images into a semantic space enriched with anatomical information. Second, we synergize LangAE with two key components to guide LDCT denoising: Semantic-Enhanced Efficient Denoiser (SEED), which enhances NDCT-relevant local semantic while capturing global features with efficient Mamba mechanism, and Language-engaged Dual-space Alignment (LangDA) Loss, which ensures that denoised images align with NDCT in both perceptual and semantic spaces. Extensive experiments on two public datasets demonstrate that LangMamba outperforms conventional state-of-the-art methods, significantly improving detail preservation and visual fidelity. Remarkably, LangAE exhibits strong generalizability to unseen datasets, thereby reducing training costs. Furthermore, LangDA loss improves explainability by integrating language-guided insights into image reconstruction and offers a plug-and-play fashion. Our findings shed new light on the potential of language as a supervisory signal to advance LDCT denoising. The code is publicly available on https://github.com/hao1635/LangMamba.
中文: LangMamba提出了一种新颖的语言驱动框架,利用视觉语言模型增强语义指导,在低剂量CT去噪中显著提升细节保留和泛化能力,同时降低训练成本。
English: LangMamba introduces a novel language-driven framework for LDCT denoising that leverages vision-language models to enhance semantic guidance, outperforming existing methods in detail preservation and generalizability while reducing training costs.
Authors:Yimeng Bai, Yang Zhang, Sihao Ding, Shaohui Ruan, Han Yao, Danhui Guan, Fuli Feng, Tat-Seng Chua
Abstract:
Diffusion models, known for their generative ability to simulate data creation through noise-adding and denoising processes, have emerged as a promising approach for building generative recommenders. To incorporate user history for personalization, existing methods typically adopt a conditional diffusion framework, where the reverse denoising process of reconstructing items from noise is modified to be conditioned on the user history. However, this design may fail to fully utilize historical information, as it gets distracted by the need to model the "item $\leftrightarrow$ noise" translation. This motivates us to reformulate the diffusion process for sequential recommendation in an unconditional manner, treating user history (instead of noise) as the endpoint of the forward diffusion process (i.e., the starting point of the reverse process), rather than as a conditional input. This formulation allows for exclusive focus on modeling the "item $\leftrightarrow$ history" translation. To this end, we introduce Brownian Bridge Diffusion Recommendation (BBDRec). By leveraging a Brownian bridge process, BBDRec enforces a structured noise addition and denoising mechanism, ensuring that the trajectories are constrained towards a specific endpoint -- user history, rather than noise. Extensive experiments demonstrate BBDRec's effectiveness in enhancing sequential recommendation performance. The source code is available at https://github.com/baiyimeng/BBDRec.
中文: 作者提出BBDRec,一种将用户历史作为扩散过程终点的无条件扩散模型,专注于项目与历史的转换,从而提升序列推荐性能。
English: The authors propose BBDRec, an unconditional diffusion model that treats user history as the endpoint of the diffusion process to focus on item-history translation, enhancing sequential recommendation performance.
Authors:Murilo Gustineli, Anthony Miyaguchi, Adrian Cheung, Divyansh Khattak
Abstract:
We describe DS@GT's second-place solution to the PlantCLEF 2025 challenge on multi-species plant identification in vegetation quadrat images. Our pipeline combines (i) a fine-tuned Vision Transformer ViTD2PC24All for patch-level inference, (ii) a 4x4 tiling strategy that aligns patch size with the network's 518x518 receptive field, and (iii) domain-prior adaptation through PaCMAP + K-Means visual clustering and geolocation filtering. Tile predictions are aggregated by majority vote and re-weighted with cluster-specific Bayesian priors, yielding a macro-averaged F1 of 0.348 (private leaderboard) while requiring no additional training. All code, configuration files, and reproducibility scripts are publicly available at https://github.com/dsgt-arc/plantclef-2025.
中文: DS@GT在PlantCLEF 2025竞赛中获得第二名的解决方案,通过微调视觉变换器结合分块策略和领域知识优化,无需额外训练即实现0.348的宏平均F1分数,所有代码均已开源。
English: DS@GT's second-place solution for PlantCLEF 2025 combines a fine-tuned Vision Transformer with tiling, clustering, and geolocation filtering, achieving a 0.348 F1 score without extra training, with all code publicly available.
Authors:Tongtong Cheng, Rongzhen Li, Yixin Xiong, Tao Zhang, Jing Wang, Kai Liu
Abstract:
Accurate driving behavior recognition and reasoning are critical for autonomous driving video understanding. However, existing methods often tend to dig out the shallow causal, fail to address spurious correlations across modalities, and ignore the ego-vehicle level causality modeling. To overcome these limitations, we propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities. Firstly, we design a multi-level feature extractor to capture long-range dependencies. Secondly, we design a causal analysis module that dynamically models driving scenarios using a directed acyclic graph (DAG) of driving states. Thirdly, we utilize a vision-language transformer to align critical visual features with their corresponding linguistic expressions. Extensive experiments on the BDD-X, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning. Furthermore, the model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications. The code is available at https://github.com/SixCorePeach/MCAM.
Chinese: 提出的多模态因果分析模型(MCAM)通过构建视觉与语言模态间的潜在因果结构,结合多层次特征提取和动态场景建模,在自动驾驶视频理解中实现了最先进的性能表现。
English: The proposed Multimodal Causal Analysis Model (MCAM) overcomes limitations in existing methods by constructing latent causal structures between visual and language modalities, achieving state-of-the-art performance in autonomous driving video understanding through multi-level feature extraction and dynamic scenario modeling.
Authors:Chang Liu, Ye Pan, Chenyang Ding, Susanto Rahardja, Xiaokang Yang
Abstract:
Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal inputs-including text descriptions and reference expression images-to guide the generation of user-specified facial expressions. With MetaHuman as the priority, our generated results can be conveniently integrated into the industrial production pipeline. The code is available at: https://github.com/SJTU-Lucy/MEDTalk.
中文: MEDTalk提出了一种新颖的动态情感三维面部动画框架,通过解耦运动序列中的内容与情感,融合音频和文本输入生成精细表情,并支持多模态控制实现个性化效果,兼容MetaHuman生产流程。
English: MEDTalk introduces a novel framework for dynamic emotional 3D facial animation by disentangling content and emotion from motion sequences, integrating audio and text inputs to generate fine-grained expressions while enabling multimodal control for personalized results compatible with MetaHuman pipelines.
Authors:Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, Tieyun Qian
Abstract:
Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85\% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17\%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security The code and data are available at https://github.com/NLPGM/CAVGAN.
中文摘要:本研究提出了一种基于GAN的框架,利用大语言模型中间层的线性可分嵌入特性,在提升越狱攻击成功率的同时强化防御效果,为理解模型内部安全机制提供了新视角。
English Summary: This study introduces a GAN-based framework that leverages the linearly separable embeddings in LLM intermediate layers to simultaneously enhance jailbreak attack and defense, achieving high success rates and providing insights into LLM security mechanisms.
Authors:George Barrowclough, Marian Andrecki, James Shinner, Daniele Donghi
Abstract:
In production recommender systems, feature preprocessing must be faithfully replicated across training and inference environments. This often requires duplicating logic between offline and online environments, increasing engineering effort and introducing risks of dataset shift. We present Kamae, an open-source Python library that bridges this gap by translating PySpark preprocessing pipelines into equivalent Keras models. Kamae provides a suite of configurable Spark transformers and estimators, each mapped to a corresponding Keras layer, enabling consistent, end-to-end preprocessing across the ML lifecycle. Framework's utility is illustrated on real-world use cases, including MovieLens dataset and Expedia's Learning-to-Rank pipelines. The code is available at https://github.com/ExpediaGroup/kamae.
中文: Kamae是一个开源Python库,能够将PySpark预处理流程转换为等效的Keras模型,从而在机器学习的全生命周期中实现训练与推理环境间的特征处理一致性,有效降低工程重复和数据集偏移风险。
English: Kamae is an open-source Python library that converts PySpark preprocessing pipelines into equivalent Keras models, ensuring consistent feature processing across training and inference environments to reduce engineering redundancy and dataset shift risks.
Authors:Bo Zhou, Kaijie Xu, Yinghui Quan, Mengdao Xing
Abstract:
Two-dimensional (2D) Multiple Signal Classification algorithm is a powerful technique for high-resolution direction-of-arrival (DOA) estimation in array signal processing. However, the exhaustive search over the 2D an-gular domain leads to high computa-tional cost, limiting its applicability in real-time scenarios. In this work, we reformulate the peak-finding process as a multimodal optimization prob-lem, and propose a Differential Evolu-tion algorithm with Neighborhood Mutation (DE-NM) to efficiently lo-cate multiple spectral peaks without requiring dense grid sampling. Simu-lation results demonstrate that the proposed method achieves comparable estimation accuracy to the traditional grid search, while significantly reduc-ing computation time. This strategy presents a promising solution for real-time, high-resolution DOA estimation in practical applications. The imple-mentation code is available at https://github.com/zzb-nice/DOA_multimodel_optimize.
中文: 本研究提出一种带邻域变异的差分进化算法,用于高效定位二维DOA估计中的谱峰,在保持与传统网格搜索相当精度的同时显著降低计算耗时,适用于实时应用场景。
English: This work introduces a Differential Evolution algorithm with Neighborhood Mutation to efficiently locate spectral peaks for 2D DOA estimation, achieving comparable accuracy to traditional grid search while significantly reducing computational time for real-time applications.
Authors:Lucas Fonseca Lage, Simon Ostermann
Abstract:
We introduce OpenFActScore, an open-source implementation of the FActScore framework for evaluating the factuality of text generated by large language models (LLMs). FActScore evaluates the factual accuracy of long-form text by using Atomic Fact Generation (AFG) to extract individual factual claims and Atomic Fact Validation (AFV) to verify each claim against a trusted knowledge source. While the original FActScore relies on closed-source and commercial models such as InstructGPT and ChatGPT, OpenFActScore enables the use of any Hugging Face-compatible model for both AFG and AFV. We provide a detailed technical overview of our implementation, highlighting design choices and modifications made to support open models. We evaluate multiple open-source LLMs on both AFG and AFV using the original FActScore benchmark, reporting BERTScore-F1 for AFG and Error Rate relative to human annotations for AFV. Our results show that open models can approximate the performance of closed-source systems, with Gemma achieving the best overall performance, and our final setup obtains a 0.99 Pearson correlation with the original FActScore experiments. OpenFActScore promotes transparency, reproducibility, and cost-effective evaluation, and is available at: https://github.com/lflage/OpenFActScore.
Chinese: OpenFActScore 是 FActScore 框架的开源实现,支持使用任何兼容 Hugging Face 的模型来评估大语言模型生成文本的事实准确性,其性能接近闭源系统,并促进了透明且成本效益高的事实性评估。
English: OpenFActScore is an open-source implementation of the FActScore framework that enables the use of any Hugging Face-compatible model for evaluating the factuality of text generated by large language models, achieving performance comparable to closed-source systems and promoting transparency and cost-effective evaluation.
Authors:Vera Soboleva, Aibek Alanov, Andrey Kuznetsov, Konstantin Sobolev
Abstract:
While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA, a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps, and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization. Extensive experiments show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment, highlighting the potential of T-LoRA in data-limited and resource-constrained scenarios. Code is available at https://github.com/ControlGenAI/T-LoRA.
中文: 本文提出T-LoRA框架,通过基于时间步的动态秩调整和正交权重参数化技术,有效解决了单图像扩散模型定制中的过拟合问题,在概念保真度与文本对齐之间实现了更优平衡。
English: This paper introduces T-LoRA, a timestep-dependent low-rank adaptation framework that addresses overfitting in single-image diffusion model customization by implementing dynamic rank adjustment and orthogonal weight initialization, achieving superior balance between concept fidelity and text alignment.
Authors:Bing Wang, Ximing Li, Mengzhe Ye, Changchun Li, Bo Fu, Jianfeng Qu, Lin Yuanbo Wu
Abstract:
Nowadays, misinformation articles, especially multimodal ones, are widely spread on social media platforms and cause serious negative effects. To control their propagation, Multimodal Misinformation Detection (MMD) becomes an active topic in the community to automatically identify misinformation. Previous MMD methods focus on supervising detectors by collecting offline data. However, in real-world scenarios, new events always continually emerge, making MMD models trained on offline data consistently outdated and ineffective. To address this issue, training MMD models under online data streams is an alternative, inducing an emerging task named continual MMD. Unfortunately, it is hindered by two major challenges. First, training on new data consistently decreases the detection performance on past data, named past knowledge forgetting. Second, the social environment constantly evolves over time, affecting the generalization on future data. To alleviate these challenges, we propose to remember past knowledge by isolating interference between event-specific parameters with a Dirichlet process-based mixture-of-expert structure, and anticipate future environmental distributions by learning a continuous-time dynamics model. Accordingly, we induce a new continual MMD method DAEDCMD. Extensive experiments demonstrate that DAEDCMD can consistently and significantly outperform the compared methods, including six MMD baselines and three continual learning methods.
中文: 针对持续多模态虚假信息检测中存在的过去知识遗忘和社会环境演变挑战,提出的DAEDCMD方法通过隔离事件特定参数和学习连续时间动态模型,显著超越了现有基线方法的性能表现。
English: To address the challenges of forgetting past knowledge and adapting to evolving social environments in continual multimodal misinformation detection, the proposed DAEDCMD method isolates event-specific parameters and learns continuous-time dynamics, achieving superior performance over existing baselines.
Authors:Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, Ziwei Liu
Abstract:
State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4\% improvement on in-distribution MME-Realworld and 5.2\% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.
中文: 本文提出MGPO强化学习框架,通过多轮对话中的自动图像裁剪使大模型能迭代聚焦关键视觉区域,在无需标注数据的情况下显著提升了视觉任务性能。
English: The paper introduces MGPO, a reinforcement learning framework that enables large multi-modal models to iteratively focus on key image regions through automatic cropping, improving performance on visual tasks without requiring costly grounding annotations.
Authors:Yuedong Tan, Jiawei Shao, Eduard Zamfir, Ruanjun Li, Zhaochong An, Chao Ma, Danda Paudel, Luc Van Gool, Radu Timofte, Zongwei Wu
Abstract:
Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities. To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness which is critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be publicly available at https://github.com/supertyd/FlexTrack/tree/main.
中文: 本文提出了一种灵活的多模态跟踪框架,通过新型融合机制和掩码策略动态适应时序不完整数据,在多个基准测试中实现了最先进的性能。
English: This paper introduces a flexible multimodal tracking framework that dynamically adapts to temporally incomplete data through a novel fusion mechanism and masking strategy, achieving state-of-the-art performance across multiple benchmarks.
Authors:Robert Leppich, Michael Stenger, André Bauer, Samuel Kounev
Abstract:
With the advent of Transformers, time series forecasting has seen significant advances, yet it remains challenging due to the need for effective sequence representation, memory construction, and accurate target projection. Time series forecasting remains a challenging task, demanding effective sequence representation, meaningful information extraction, and precise future projection. Each dataset and forecasting configuration constitutes a distinct task, each posing unique challenges the model must overcome to produce accurate predictions. To systematically address these task-specific difficulties, this work decomposes the time series forecasting pipeline into three core stages: input sequence representation, information extraction and memory construction, and final target projection. Within each stage, we investigate a range of architectural configurations to assess the effectiveness of various modules, such as convolutional layers for feature extraction and self-attention mechanisms for information extraction, across diverse forecasting tasks, including evaluations on seven benchmark datasets. Our models achieve state-of-the-art forecasting accuracy while greatly enhancing computational efficiency, with reduced training and inference times and a lower parameter count. The source code is available at https://github.com/RobertLeppich/REP-Net.
Chinese: 本研究将时间序列预测系统分解为序列表示、记忆构建和目标投影三个阶段,在七个基准数据集上实现了最先进的预测精度,并显著提升了计算效率。
English: This study systematically breaks down time series forecasting into three stages—sequence representation, memory construction, and target projection—achieving state-of-the-art accuracy and improved computational efficiency across seven benchmark datasets.
Authors:Jian Kai, Tianwei Zhang, Zihan Ling, Yang Cao, Can Shen
Abstract:
Accurate bandwidth estimation (BWE) is critical for real-time communication (RTC) systems. Traditional heuristic approaches offer limited adaptability under dynamic networks, while online reinforcement learning (RL) suffers from high exploration costs and potential service disruptions. Offline RL, which leverages high-quality data collected from real-world environments, offers a promising alternative. However, challenges such as out-of-distribution (OOD) actions, policy extraction from behaviorally diverse datasets, and reliable deployment in production systems remain unsolved. We propose RBWE, a robust bandwidth estimation framework based on offline RL that integrates Q-ensemble (an ensemble of Q-functions) with a Gaussian mixture policy to mitigate OOD risks and enhance policy learning. A fallback mechanism ensures deployment stability by switching to heuristic methods under high uncertainty. Experimental results show that RBWE reduces overestimation errors by 18% and improves the 10th percentile Quality of Experience (QoE) by 18.6%, demonstrating its practical effectiveness in real-world RTC applications. The implementation is publicly available at https://github.com/jiu2021/RBWE_offline.
中文: RBWE是一种基于离线强化学习的鲁棒带宽估计框架,通过集成Q函数和高斯混合策略来应对分布外风险,将高估误差降低18%,并将体验质量提升18.6%,确保实时通信系统的稳定部署。
English: RBWE is an offline reinforcement learning framework that enhances bandwidth estimation for real-time communication by using Q-ensemble and a Gaussian mixture policy to address out-of-distribution risks, reducing overestimation errors by 18% and improving QoE by 18.6%.
Authors:Tristan Kirscher, Sylvain Faisan, Xavier Coubez, Loris Barrier, Philippe Meyer
Abstract:
Pediatric medical imaging presents unique challenges due to significant anatomical and developmental differences compared to adults. Direct application of segmentation models trained on adult data often yields suboptimal performance, particularly for small or rapidly evolving structures. To address these challenges, several strategies leveraging the nnU-Net framework have been proposed, differing along four key axes: (i) the fingerprint dataset (adult, pediatric, or a combination thereof) from which the Training Plan -including the network architecture-is derived; (ii) the Learning Set (adult, pediatric, or mixed), (iii) Data Augmentation parameters, and (iv) the Transfer learning method (finetuning versus continual learning). In this work, we introduce PSAT (Pediatric Segmentation Approaches via Adult Augmentations and Transfer learning), a systematic study that investigates the impact of these axes on segmentation performance. We benchmark the derived strategies on two pediatric CT datasets and compare them with state-of-theart methods, including a commercial radiotherapy solution. PSAT highlights key pitfalls and provides actionable insights for improving pediatric segmentation. Our experiments reveal that a training plan based on an adult fingerprint dataset is misaligned with pediatric anatomy-resulting in significant performance degradation, especially when segmenting fine structures-and that continual learning strategies mitigate institutional shifts, thus enhancing generalization across diverse pediatric datasets. The code is available at https://github.com/ICANS-Strasbourg/PSAT.
中文摘要:PSAT研究系统评估了基于nnU-Net框架的儿科医学影像分割策略,发现成人数据训练的模型在儿科数据上表现欠佳,而持续学习方法能有效提升跨机构泛化能力。
English summary: The PSAT study systematically evaluates pediatric medical image segmentation strategies using the nnU-Net framework, revealing that adult-trained models underperform on pediatric data while continual learning methods improve cross-institutional generalization.
Authors:Kechen Liu
Abstract:
Self-Attentive Sequential Recommendation (SASRec) effectively captures long-term user preferences by applying attention mechanisms to historical interactions. Concurrently, the rise of Large Language Models (LLMs) has motivated research into LLM-based recommendation, which leverages their powerful generalization and language understanding capabilities. However, LLMs often lack the domain-specific knowledge and collaborative signals essential for high-quality recommendations when relying solely on textual prompts. To address this limitation, this study proposes SASRecLLM, a novel framework that integrates SASRec as a collaborative encoder with an LLM fine-tuned using Low-Rank Adaptation (LoRA). The components are connected via a mapping layer to align their dimensional spaces, and three targeted training strategies are designed to optimize the hybrid architecture. Extensive experiments on multiple datasets demonstrate that SASRecLLM achieves robust and consistent improvements over strong baselines in both cold-start and warm-start scenarios. This work advances the field of LLM-based recommendation by presenting a modular and effective paradigm for fusing structured collaborative filtering with the semantic power of fine-tuned LLMs. The implementation is available on GitHub: https://github.com/kechenkristin/RecLLM
中文摘要:SASRecLLM创新性地通过维度对齐和专项训练策略,将SASRec的协同过滤优势与微调大语言模型的语义能力相结合,在冷启动和热启动推荐场景中均实现了优越性能。
English Summary: SASRecLLM is a novel framework that integrates the collaborative filtering strengths of SASRec with the semantic capabilities of fine-tuned LLMs through dimensional alignment and specialized training strategies, achieving superior performance in both cold-start and warm-start recommendation scenarios.
Authors:Ruofei Wang, Peiqi Duan, Boxin Shi, Renjie Wan
Abstract:
With more event datasets being released online, safeguarding the event dataset against unauthorized usage has become a serious concern for data owners. Unlearnable Examples are proposed to prevent the unauthorized exploitation of image datasets. However, it's unclear how to create unlearnable asynchronous event streams to prevent event misuse. In this work, we propose the first unlearnable event stream generation method to prevent unauthorized training from event datasets. A new form of asynchronous event error-minimizing noise is proposed to perturb event streams, tricking the unauthorized model into learning embedded noise instead of realistic features. To be compatible with the sparse event, a projection strategy is presented to sparsify the noise to render our unlearnable event streams (UEvs). Extensive experiments demonstrate that our method effectively protects event data from unauthorized exploitation, while preserving their utility for legitimate use. We hope our UEvs contribute to the advancement of secure and trustworthy event dataset sharing. Code is available at: https://github.com/rfww/uevs.
中文: 本文首次提出不可学习事件流生成方法,通过添加稀疏的误差最小化噪声来干扰事件流,使未经授权的模型学习嵌入噪声而非真实特征,从而在保护数据合法使用的同时防止事件数据集被非法利用。
English: This paper introduces the first method for generating unlearnable event streams to protect event datasets from unauthorized training by adding sparse, error-minimizing noise that misleads models while preserving legitimate data utility.
Authors:Guohao Li, Li Jing, Jia Wu, Xuefei Li, Kai Zhu, Yue He
Abstract:
Most existing multimodal collaborative filtering recommendation (MCFRec) methods rely heavily on ID features and multimodal content to enhance recommendation performance. However, this paper reveals that ID features are effective but have limited benefits in multimodal collaborative filtering recommendation. Therefore, this paper systematically deconstruct the pros and cons of ID features: (i) they provide initial embedding but lack semantic richness, (ii) they provide a unique identifier for each user and item but hinder generalization to untrained data, and (iii) they assist in aligning and fusing multimodal features but may lead to representation shift. Based on these insights, this paper proposes IDFREE, an ID-free multimodal collaborative Filtering REcommEndation baseline. IDFREE replaces ID features with multimodal features and positional encodings to generate semantically meaningful ID-free embeddings. For ID-free multimodal collaborative filtering, it further proposes an adaptive similarity graph module to construct dynamic user-user and item-item graphs based on multimodal features. Then, an augmented user-item graph encoder is proposed to construct more effective user and item encoding. Finally, IDFREE achieves inter-multimodal alignment based on the contrastive learning and uses Softmax loss as recommendation loss. Basic experiments on three public datasets demonstrate that IDFREE outperforms existing ID-based MCFRec methods, achieving an average performance gain of 72.24% across standard metrics (Recall@5, 10, 20, 50 and NDCG@5, 10, 20, 50). Exploratory and extended experiments further validate our findings on the limitations of ID features in MCFRec. The code is released at https://github.com/G-H-Li/IDFREE.
Chinese: 本文提出IDFREE,一种无需ID的多模态协同过滤推荐基准方法,通过用多模态特征和位置编码替代传统ID特征,在关键指标上比现有基于ID的方法性能平均提升72.24%。
English: This paper introduces IDFREE, an ID-free multimodal collaborative filtering recommendation baseline that replaces traditional ID features with multimodal features and positional encodings, achieving superior performance over existing ID-based methods with a 72.24% average gain across key metrics.
Authors:Weihua Du, Pranjal Aggarwal, Sean Welleck, Yiming Yang
Abstract:
Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces. Tool-augmented agents address arithmetic via code execution, but often falter on complex logical tasks. We introduce a fine-tuning framework, DualDistill, that distills complementary reasoning strategies from multiple teachers into a unified student model. Using this approach, we train Agentic-R1, which dynamically selects the optimal strategy for each query, invoking tools for arithmetic and algorithmic problems, and using text-based reasoning for abstract ones. Our method improves accuracy across a range of tasks, including both computation-intensive and standard benchmarks, demonstrating the effectiveness of multi-strategy distillation in achieving robust and efficient reasoning. Our project is available at https://github.com/StigLidu/DualDistill
中文: DualDistill框架通过融合多位教师的互补推理策略,训练出能动态选择文本推理或工具调用的统一学生模型,在各类任务中显著提升了推理准确性。
English: The DualDistill framework fine-tunes a unified student model by distilling complementary reasoning strategies from multiple teachers, enabling dynamic selection of text-based reasoning or tool invocation for enhanced accuracy across diverse tasks.
Authors:Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, Maosong Sun
Abstract:
Kernel development in deep learning requires optimizing computational units across hardware while balancing memory management, parallelism, and hardware-specific optimizations through extensive empirical tuning. Although domain-specific languages like Triton simplify GPU programming by abstracting low-level details, developers must still manually tune critical parameters such as tile sizes and memory access patterns through iterative experimentation, creating substantial barriers to optimal performance and wider adoption. In this work, we introduce AutoTriton, the first model dedicated to Triton programming powered by reinforcement learning (RL). AutoTriton performs supervised fine-tuning (SFT) to be equipped with essential Triton programming expertise using a high-quality data gathering pipeline, and conducts RL with Group Relative Policy Optimization (GRPO) algorithm, combining a rule-based reward and an execution-based reward to further improve Triton programming ability, sequentially. Experiments across five evaluation channels of TritonBench and KernelBench illustrate that our 8B model AutoTriton achieves performance comparable to mainstream large models, including Claude-4-Sonnet and DeepSeek-R1-0528. Further experimental analysis demonstrates the crucial role of each module within AutoTriton, including the SFT stage, the RL stage, and the reward design strategy. These findings underscore the promise of RL for automatically generating high-performance kernels, and since high-performance kernels are core components of AI systems, this breakthrough establishes an important foundation for building more efficient AI systems. The model and code will be available at https://github.com/AI9Stars/AutoTriton.
中文: AutoTriton是一种基于强化学习的模型,通过监督微调结合规则与执行奖励来自动化Triton编程,在性能上媲美主流大模型,展现了自动生成高性能内核以构建更高效AI系统的潜力。
English: AutoTriton is a reinforcement learning-based model that automates Triton programming by combining supervised fine-tuning with rule-based and execution-based rewards, achieving performance comparable to leading large models and demonstrating the potential for automatically generating high-performance kernels to build more efficient AI systems.
Authors:Rongsheng Wang, Junying Chen, Ke Ji, Zhenyang Cai, Shunian Chen, Yunjin Yang, Benyou Wang
Abstract:
Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at https://github.com/FreedomIntelligence/MedGen
中文摘要:针对医学视频生成领域缺乏专业数据集的问题,我们开发了MedVideoCap-55K大规模数据集,并基于此构建的MedGen模型在视觉质量和医学准确性方面均达到了领先水平。
English Summary: Medical video generation has been limited by the absence of specialized datasets, leading to the creation of MedVideoCap-55K, a comprehensive dataset that enables MedGen to achieve top-tier performance in both visual quality and medical precision.
Authors:Alexandre Friou
Abstract:
This article introduces MCNP-GO (https://github.com/afriou/mcnpgo), a Python package designed to manipulate and assemble MCNP input files, allowing users to assemble a set of independent objects, each described by a valid MCNP file, into a single cohesive file. This tool is particularly useful for applications where precise modeling and positioning of equipment are crucial. The package addresses the challenges of managing large databases of MCNP input files, ensuring reliability and traceability through configuration management systems. MCNP-GO provides functionalities such as renumbering, extracting subsets of files, transforming files, and assembling files while managing collisions and materials. It also keeps track of the operations performed on files, enhancing traceability and ease of modification. The article demonstrates the package's capabilities through a practical example of assembling an MCNP input file for a tomographic experiment, highlighting its efficiency and user-friendliness. MCNP-GO is designed for users with minimal Python knowledge.
中文: MCNP-GO是一款Python工具包,可简化MCNP输入文件的组装与管理,使用户能够将独立对象合并为单一文件,同时确保可追溯性和处理冲突,且无需深厚的Python知识。
English: MCNP-GO is a Python package that simplifies the assembly and management of MCNP input files, enabling users to combine individual objects into a single file while ensuring traceability and handling collisions, with minimal Python expertise required.
Authors:Kaixiang Zhao, Joseph Yousry Attalla, Qian Lou, Yushun Dong
Abstract:
Graph Neural Networks (GNNs) have achieved state-of-the-art performance in various graph-based learning tasks. However, enabling privacy-preserving GNNs in encrypted domains, such as under Fully Homomorphic Encryption (FHE), typically incurs substantial computational overhead, rendering real-time and privacy-preserving inference impractical. In this work, we propose DESIGN (EncrypteD GNN Inference via sErver-Side Input Graph pruNing), a novel framework for efficient encrypted GNN inference. DESIGN tackles the critical efficiency limitations of existing FHE GNN approaches, which often overlook input data redundancy and apply uniform computational strategies. Our framework achieves significant performance gains through a hierarchical optimization strategy executed entirely on the server: first, FHE-compatible node importance scores (based on encrypted degree statistics) are computed from the encrypted graph. These scores then guide a homomorphic partitioning process, generating multi-level importance masks directly under FHE. This dynamically generated mask facilitates both input graph pruning (by logically removing unimportant elements) and a novel adaptive polynomial activation scheme, where activation complexity is tailored to node importance levels. Empirical evaluations demonstrate that DESIGN substantially accelerates FHE GNN inference compared to state-of-the-art methods while maintaining competitive model accuracy, presenting a robust solution for secure graph analytics. Our implementation is publicly available at https://github.com/LabRAI/DESIGN.
中文摘要:DESIGN框架通过服务器端分层优化,在完全同态加密环境下动态修剪输入图数据并根据节点重要性自适应调整激活复杂度,实现了高效的加密图神经网络推理。
English Summary: The DESIGN framework enables efficient encrypted Graph Neural Network inference by implementing server-side hierarchical optimization that dynamically prunes input graphs and adapts activation complexity based on node importance under Fully Homomorphic Encryption.
Authors:Ammar Daskin
Abstract:
In this paper, we describe a parameterized quantum circuit that can be considered as convolutional and pooling layers for graph neural networks. The circuit incorporates the parameterized quantum Fourier circuit where the qubit connections for the controlled gates derived from the Laplacian operator. Specifically, we show that the eigenspace of the Laplacian operator of a graph can be approximated by using QFT based circuit whose connections are determined from the adjacency matrix. For an $N\times N$ Laplacian, this approach yields an approximate polynomial-depth circuit requiring only $n=log(N)$ qubits. These types of circuits can eliminate the expensive classical computations for approximating the learnable functions of the Laplacian through Chebyshev polynomial or Taylor expansions.
Using this circuit as a convolutional layer provides an $n-$ dimensional probability vector that can be considered as the filtered and compressed graph signal. Therefore, the circuit along with the measurement can be considered a very efficient convolution plus pooling layer that transforms an $N$-dimensional signal input into $n-$dimensional signal with an exponential compression. We then apply a classical neural network prediction head to the output of the circuit to construct a complete graph neural network. Since the circuit incorporates geometric structure through its graph connection-based approach, we present graph classification results for the benchmark datasets listed in TUDataset library. Using only [1-100] learnable parameters for the quantum circuit and minimal classical layers (1000-5000 parameters) in a generic setting, the obtained results are comparable to and in some cases better than many of the baseline results, particularly for the cases when geometric structure plays a significant role.
中文: 本文提出了一种参数化量子电路,可作为图神经网络的卷积和池化层,以极少的参数实现图信号的指数级压缩,并在保持几何结构的关键数据集中取得优于或可比拟基准结果的性能。
English: This paper introduces a parameterized quantum circuit that functions as convolutional and pooling layers for graph neural networks, achieving exponential compression of graph signals while maintaining competitive performance with minimal parameters.
Authors:Shuo Shao, Yiming Li, Mengren Zheng, Zhiyang Hu, Yukun Chen, Boheng Li, Yu He, Junfeng Guo, Dacheng Tao, Zhan Qin
Abstract:
The widespread application of Deep Learning across diverse domains hinges critically on the quality and composition of training datasets. However, the common lack of disclosure regarding their usage raises significant privacy and copyright concerns. Dataset auditing techniques, which aim to determine if a specific dataset was used to train a given suspicious model, provide promising solutions to addressing these transparency gaps. While prior work has developed various auditing methods, their resilience against dedicated adversarial attacks remains largely unexplored. To bridge the gap, this paper initiates a comprehensive study evaluating dataset auditing from an adversarial perspective. We start with introducing a novel taxonomy, classifying existing methods based on their reliance on internal features (IF) (inherent to the data) versus external features (EF) (artificially introduced for auditing). Subsequently, we formulate two primary attack types: evasion attacks, designed to conceal the use of a dataset, and forgery attacks, intending to falsely implicate an unused dataset. Building on the understanding of existing methods and attack objectives, we further propose systematic attack strategies: decoupling, removal, and detection for evasion; adversarial example-based methods for forgery. These formulations and strategies lead to our new benchmark, DATABench, comprising 17 evasion attacks, 5 forgery attacks, and 9 representative auditing methods. Extensive evaluations using DATABench reveal that none of the evaluated auditing methods are sufficiently robust or distinctive under adversarial settings. These findings underscore the urgent need for developing a more secure and reliable dataset auditing method capable of withstanding sophisticated adversarial manipulation. Code is available at https://github.com/shaoshuo-ss/DATABench.
中文: 本文从对抗性角度全面评估数据集审计方法,提出分类法和系统性攻击策略,揭示现有方法易受操纵的脆弱性,并建立DATABench基准,证明亟需开发更鲁棒的审计技术。
English: This paper introduces a comprehensive adversarial evaluation of dataset auditing methods, proposing a taxonomy and systematic attack strategies that reveal their vulnerability to manipulation, and establishes the DATABench benchmark to demonstrate the urgent need for more robust auditing techniques.
Authors:Shuai Li, Shihan Chen, Wanru Geng, Zhaohua Xu, Xiaolu Liu, Can Dong, Zhen Tian, Changlin Chen
Abstract:
In the realm of industrial quality inspection, defect detection stands as a critical component, particularly in high-precision, safety-critical sectors such as automotive components aerospace, and medical devices. Traditional methods, reliant on manual inspection or early image processing algorithms, suffer from inefficiencies, high costs, and limited robustness. This paper introduces a semi-supervised defect detection framework based on conditional diffusion (DSYM), leveraging a two-stage collaborative training mechanism and a staged joint optimization strategy. The framework utilizes labeled data for initial training and subsequently incorporates unlabeled data through the generation of pseudo-labels. A conditional diffusion model synthesizes multi-scale pseudo-defect samples, while a CLIP cross-modal feature-based noise filtering mechanism mitigates label contamination. Experimental results on the NEU-DET dataset demonstrate a 78.4% mAP@0.5 with the same amount of labeled data as traditional supervised methods, and 75.1% mAP@0.5 with only 40% of the labeled data required by the original supervised model, showcasing significant advantages in data efficiency. This research provides a high-precision, low-labeling-dependent solution for defect detection in industrial quality inspection scenarios. The work of this article has been open-sourced at https://github.com/cLin-c/Semisupervised-DSYM.
中文: 本文提出了一种基于条件扩散的半监督缺陷检测框架DSYM,通过两阶段协同训练和伪标签生成,在减少标注数据需求的同时实现了高精度检测,在NEU-DET数据集上展现了显著的数据效率优势。
English: This paper presents a semi-supervised defect detection framework called DSYM that uses conditional diffusion models and a two-stage training approach to achieve high precision with reduced labeled data, demonstrating superior data efficiency on the NEU-DET dataset.
Authors:Pedro R. A. S. Bassi, Wenxuan Li, Jieneng Chen, Zheren Zhu, Tianyu Lin, Sergio Decherchi, Andrea Cavalli, Kang Wang, Yang Yang, Alan L. Yuille, Zongwei Zhou
Abstract:
Tumor segmentation in CT scans is key for diagnosis, surgery, and prognosis, yet segmentation masks are scarce because their creation requires time and expertise. Public abdominal CT datasets have from dozens to a couple thousand tumor masks, but hospitals have hundreds of thousands of tumor CTs with radiology reports. Thus, leveraging reports to improve segmentation is key for scaling. In this paper, we propose a report-supervision loss (R-Super) that converts radiology reports into voxel-wise supervision for tumor segmentation AI. We created a dataset with 6,718 CT-Report pairs (from the UCSF Hospital), and merged it with public CT-Mask datasets (from AbdomenAtlas 2.0). We used our R-Super to train with these masks and reports, and strongly improved tumor segmentation in internal and external validation--F1 Score increased by up to 16% with respect to training with masks only. By leveraging readily available radiology reports to supplement scarce segmentation masks, R-Super strongly improves AI performance both when very few training masks are available (e.g., 50), and when many masks were available (e.g., 1.7K).
Project: https://github.com/MrGiovanni/R-Super
中文: 本文提出R-Super方法,通过将放射学报告转化为体素级监督来增强CT扫描中的肿瘤分割,在仅有少量分割掩码的情况下仍能显著提升AI性能,F1分数最高提升16%。
English: This paper introduces R-Super, a novel report-supervision loss that converts radiology reports into voxel-wise supervision to enhance tumor segmentation in CT scans, significantly improving AI performance by up to 16% in F1 scores even with limited segmentation masks.
Authors:Jie Huang, Daiheng Zhang
Abstract:
Structure-based drug design (SBDD) seeks to generate molecules that bind effectively to protein targets by leveraging their 3D structural information. While diffusion-based generative models have become the predominant approach for SBDD, alternative non-autoregressive frameworks remain relatively underexplored. In this work, we introduce MolFORM, a novel generative framework that jointly models discrete (atom types) and continuous (3D coordinates) molecular modalities using multi-flow matching. To further enhance generation quality, we incorporate a preference-guided fine-tuning stage based on Direct Preference Optimization (DPO), using Vina score as a reward signal. We propose a multi-modal flow DPO co-modeling strategy that simultaneously aligns discrete and continuous modalities, leading to consistent improvements across multiple evaluation metrics. The code is provided at: https://github.com/huang3170/MolForm.
中文: 本文提出MolFORM框架,通过多流匹配联合建模分子离散和连续模态,并采用基于直接偏好优化的引导微调策略,有效提升了基于结构的药物设计生成质量。
English: This paper introduces MolFORM, a multi-flow matching framework that jointly models discrete and continuous molecular modalities for structure-based drug design, enhanced by preference-guided fine-tuning using Direct Preference Optimization to improve generation quality.
Authors:Arthur Deng, Karsten Householder, Fang Wu, Sebastian Thrun, K. Christopher Garcia, Brian Trippe
Abstract:
Accurate estimation of mutational effects on protein-protein binding energies is an open problem with applications in structural biology and therapeutic design. Several deep learning predictors for this task have been proposed, but, presumably due to the scarcity of binding data, these methods underperform computationally expensive estimates based on empirical force fields. In response, we propose a transfer-learning approach that leverages advances in protein sequence modeling and folding stability prediction for this task. The key idea is to parameterize the binding energy as the difference between the folding energy of the protein complex and the sum of the folding energies of its binding partners. We show that using a pre-trained inverse-folding model as a proxy for folding energy provides strong zero-shot performance, and can be fine-tuned with (1) copious folding energy measurements and (2) more limited binding energy measurements. The resulting predictor, StaB-ddG, is the first deep learning predictor to match the accuracy of the state-of-the-art empirical force-field method FoldX, while offering an over 1,000x speed-up.
Chinese: 本研究提出StaB-ddG方法,通过将蛋白质结合能建模为折叠能差异,利用迁移学习精准估算突变对蛋白质结合能的影响,在保持顶尖精度的同时实现了计算速度的千倍提升。
English: This study introduces StaB-ddG, a transfer-learning approach that accurately estimates mutational effects on protein-protein binding energies by modeling them as differences in folding energies, achieving state-of-the-art accuracy with a significant speed improvement over existing methods.
Authors:Andrew Randono
Abstract:
Diffusion models for image generation function by progressively adding noise to an image set and training a model to separate out the signal from the noise. The noise profile used by these models is white noise -- that is, noise based on independent normal distributions at each point whose mean and variance is independent of the scale. By contrast, most natural image sets exhibit a type of scale invariance in their low-order statistical properties characterized by a power-law scaling. Consequently, natural images are closer (in a quantifiable sense) to a different probability distribution that emphasizes large scale correlations and de-emphasizes small scale correlations. These scale invariant noise profiles can be incorporated into diffusion models in place of white noise to form what we will call a ``Cloud Diffusion Model". We argue that these models can lead to faster inference, improved high-frequency details, and greater controllability. In a follow-up paper, we will build and train a Cloud Diffusion Model that uses scale invariance at a fundamental level and compare it to classic, white noise diffusion models.
Chinese: 云扩散模型用与自然图像统计特性更匹配的尺度不变噪声谱替代传统扩散模型中的白噪声,有望实现更快的推理速度、更优的高频细节和更强的可控性。
English: Cloud Diffusion Models replace the white noise in traditional diffusion models with scale-invariant noise profiles that better match natural images' statistical properties, promising faster inference, enhanced high-frequency details, and improved controllability.
Authors:Ashima Suvarna, Christina Chance, Karolina Naranjo, Hamid Palangi, Sophie Hao, Thomas Hartvigsen, Saadia Gabriel
Abstract:
Automatic toxic language detection is critical for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups. To capture the role of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on context-augmented posts. Finally, we release LLAMACITIZEN-8B and GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS, which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation. The data, models and code are available at https://github.com/asuvarna31/modelcitizens.
中文: 本研究推出了MODELCITIZENS数据集,包含多样化的毒性标注,证明现有检测工具在其上表现不佳,而基于社区认知训练的模型在包容性内容审核中效果更优。
English: This study introduces MODELCITIZENS, a dataset with diverse toxicity annotations, and shows that current detection tools underperform on it, while their community-informed models achieve better results for inclusive moderation.
Authors:Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani
Abstract:
Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our asymmetric optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during rounds. Through extensive experiments across eleven datasets, including domain- and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods. The code is available at https://github.com/sajjad-ucsb/pFedMMA.
中文:pFedMMA框架通过引入多模态适配器和共享全局投影,在联邦学习中兼顾个性化与泛化能力,在多种数据集上实现最优性能,并保持高效的通信效率。
English: The pFedMMA framework introduces multi-modal adapters with a shared global projection to enhance both personalization and generalization in federated learning, achieving superior performance across diverse datasets while maintaining communication efficiency.
Authors:Chi-Chang Lee, Zhang-Wei Hong, Pulkit Agrawal
Abstract:
In many reinforcement learning (RL) applications, augmenting the task rewards with heuristic rewards that encode human priors about how a task should be solved is crucial for achieving desirable performance. However, because such heuristics are usually not optimal, much human effort and computational resources are wasted in carefully balancing tasks and heuristic rewards. Theoretically rigorous ways of incorporating heuristics rely on the idea of \textit{policy invariance}, which guarantees that the performance of a policy obtained by maximizing heuristic rewards is the same as the optimal policy with respect to the task reward. However, in practice, policy invariance doesn't result in policy improvement, and such methods are known to empirically perform poorly. We propose a new paradigm to mitigate reward hacking and effectively use heuristics based on the practical goal of maximizing policy improvement instead of policy improvement. Our framework, Heuristic Enhanced Policy Optimization (HEPO), effectively leverages heuristics while avoiding the pitfall of prior methods for mitigating reward hacking. HEPO achieves superior performance on standard benchmarks with well-engineered reward functions. More surprisingly, HEPO allows policy optimization to achieve good performance even when heuristics are not well-engineered and designed by non-expert humans, showcasing HEPO's ability to reduce human effort in reward design. % HEPO is a plug-and-play optimization method for leveraging heuristics in reinforcement learning. Code is available at https://github.com/Improbable-AI/hepo.
在强化学习中,HEPO通过专注于策略改进而非策略不变性,提供了一种有效利用启发式奖励的新方法,实现了更优性能并减少了对专家设计启发式的依赖。
In reinforcement learning, HEPO offers a novel approach to effectively utilize heuristic rewards by focusing on policy improvement rather than policy invariance, achieving superior performance and reducing the need for expert-designed heuristics.
Authors:Cheng Yuan, Xinkai Rui, Yongqi Fan, Yawei Fan, Boyang Zhong, Jiacheng Wang, Weiyan Zhang, Tong Ruan
Abstract:
Despite the remarkable performance of Large Language Models (LLMs) in automated discharge summary generation, they still suffer from hallucination issues, such as generating inaccurate content or fabricating information without valid sources. In addition, electronic medical records (EMRs) typically consist of long-form data, making it challenging for LLMs to attribute the generated content to the sources. To address these challenges, we propose LCDS, a Logic-Controlled Discharge Summary generation system. LCDS constructs a source mapping table by calculating textual similarity between EMRs and discharge summaries to constrain the scope of summarized content. Moreover, LCDS incorporates a comprehensive set of logical rules, enabling it to generate more reliable silver discharge summaries tailored to different clinical fields. Furthermore, LCDS supports source attribution for generated content, allowing experts to efficiently review, provide feedback, and rectify errors. The resulting golden discharge summaries are subsequently recorded for incremental fine-tuning of LLMs. Our project and demo video are in the GitHub repository https://github.com/ycycyc02/LCDS.
中文: LCDS系统通过构建源映射表并整合逻辑规则,解决了大型语言模型在生成出院小结时的幻觉问题,确保内容可溯源且适应不同临床需求。
English: The proposed LCDS system addresses hallucination and source attribution challenges in LLM-generated discharge summaries by using a source mapping table and logical rules to produce reliable summaries with traceable content origins.
Authors:Yue Wang, Miao Zhou, Guijing Huang, Rui Zhuo, Chao Yi, Zhenliang Ma
Abstract:
Pre-timed traffic signal control, commonly used for operating signalized intersections and coordinated arterials, requires tedious manual work for signaling plan creating and updating. When the time-of-day or day-of-week plans are utilized, one intersection is often associated with multiple plans, leading to further repetitive manual plan parameter inputting. To enable a user-friendly traffic signal control plan management process, this study proposes Chat2SPaT, a method to convert users' semi-structured and ambiguous descriptions on the signal control plan to exact signal phase and timing (SPaT) results, which could further be transformed into structured stage-based or ring-based plans to interact with intelligent transportation system (ITS) software and traffic signal controllers. With curated prompts, Chat2SPaT first leverages large language models' (LLMs) capability of understanding users' plan descriptions and reformulate the plan as a combination of phase sequence and phase attribute results in the json format. Based on LLM outputs, python scripts are designed to locate phases in a cycle, address nuances of traffic signal control, and finally assemble the complete traffic signal control plan. Within a chat, the pipeline can be utilized iteratively to conduct further plan editing. Experiments show that Chat2SPaT can generate plans with an accuracy of over 94% for both English and Chinese cases, using a test dataset with over 300 plan descriptions. As the first benchmark for evaluating LLMs' capability of understanding traffic signal control plan descriptions, Chat2SPaT provides an easy-to-use plan management pipeline for traffic practitioners and researchers, serving as a potential new building block for a more accurate and versatile application of LLMs in the field of ITS. The source codes, prompts and test dataset are openly accessible at https://github.com/yuewangits/Chat2SPaT.
中文摘要:本研究提出Chat2SPaT方法,利用大语言模型将用户描述转化为精确的交通信号配时方案,准确率超过94%,为交通领域提供了便捷的智能管理解决方案。
English Summary: This study introduces Chat2SPaT, a method using large language models to convert user descriptions into precise traffic signal plans, achieving over 94% accuracy and simplifying management for transportation systems.
Authors:Lingyue Fu, Hao Guan, Bolun Zhang, Haowei Yuan, Yaoming Zhu, Jun Xu, Zongyu Wang, Lin Qiu, Xunliang Cai, Xuezhi Cao, Weiwen Liu, Weinan Zhang, Yong Yu
Abstract:
As Large Language Models (LLMs) demonstrate increasingly sophisticated code processing capabilities, evaluating their performance on engineering-level code remains challenging. Existing repository-level benchmarks primarily focus on single scenarios, such as code generation or bug fixing, without adequately capturing the diversity and complexity of real-world software or project engineering workflows. Furthermore, these benchmarks suffer from limited controllability in question positioning and reliability issues in their generated test cases. To address these limitations, we present CorePipe, a fully automated pipeline that converts repositories into comprehensive test cases, and introduce CoreCodeBench, a configurable multi-scenario repository-level benchmark. To simulate real engineering scenarios, CorePipe generates three types of atomic questions (Development, BugFix, and Test-Driven Development) specifically targeting core code segments. These atomic questions are further combined into three types of composite questions, with difficulty levels flexibly adjusted through hyperparameter tuning. CoreCodeBench provides a comprehensive and extensive repository-level benchmark to investigate the applicability of LLMs in real-world engineering projects. Experiments with 16 LLMs across diverse scenarios reveal varying capabilities and offer multi-dimensional insights into LLM performance in engineering contexts. The code for CorePipe is available at https://github.com/AGI-Eval-Official/CoreCodeBench, and the data for CoreCodeBench can be accessed at https://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa.
中文摘要:CorePipe通过自动化流水线构建了可配置的多场景仓库级基准CoreCodeBench,采用原子问题和复合问题评估大语言模型在工程代码中的实际应用能力,解决了现有基准在真实工程场景覆盖度和测试可靠性方面的不足。
English Summary: CorePipe introduces an automated pipeline to create CoreCodeBench, a configurable multi-scenario repository-level benchmark that evaluates LLMs' engineering code capabilities through atomic and composite questions, addressing limitations of existing benchmarks by simulating real-world software complexity.
Authors:Weibing Zheng, Laurah Turner, Jess Kropczynski, Murat Ozer, Seth Overla, Shane Halse
Abstract:
Assisting medical students with clinical reasoning (CR) during clinical scenario training remains a persistent challenge in medical education. This paper presents the design and architecture of the Fuzzy Supervisor Agent (FSA), a novel component for the Multi-Agent Educational Clinical Scenario Simulation (MAECSS) platform. The FSA leverages a Fuzzy Inference System (FIS) to continuously interpret student interactions with specialized clinical agents (e.g., patient, physical exam, diagnostic, intervention) using pre-defined fuzzy rule bases for professionalism, medical relevance, ethical behavior, and contextual distraction. By analyzing student decision-making processes in real-time, the FSA is designed to deliver adaptive, context-aware feedback and provides assistance precisely when students encounter difficulties. This work focuses on the technical framework and rationale of the FSA, highlighting its potential to provide scalable, flexible, and human-like supervision in simulation-based medical education. Future work will include empirical evaluation and integration into broader educational settings. More detailed design and implementation is~\href{https://github.com/2sigmaEdTech/MAS/}{open sourced here}.
Chinese: 模糊监督代理(FSA)作为临床模拟平台的新组件,通过模糊逻辑监控学生互动,在临床推理训练中提供自适应的情境感知反馈。
English: The Fuzzy Supervisor Agent (FSA) is a novel component for clinical simulation platforms that uses fuzzy logic to monitor student interactions and provide adaptive, context-aware feedback during clinical reasoning training.
Authors:Hongyang Li, Sanjoy Dey, Bum Chul Kwon, Michael Danziger, Michal Rosen-Tzvi, Jianying Hu, James Kozloski, Ching-Huei Tsou, Bharath Dandala, Pablo Meyer
Abstract:
Large language models (LLMs) trained on text demonstrated remarkable results on natural language processing (NLP) tasks. These models have been adapted to decipher the language of DNA, where sequences of nucleotides act as "words" that encode genomic functions. However, the genome differs fundamentally from natural language, as it lacks clearly defined words or a consistent grammar. Although DNA language models (DNALMs) such as DNABERT, GENA-LM have achieved high level of performance on genome-related biological tasks, these models do not encode biological functions in the presence of sequence variations. To address this problem, we pre-train foundation models that effectively integrate sequence variations, in particular Single Nucleotide Polymorphisms (SNPs), as they underlie important biological functions. Specifically, we use ModernBERT to pre-train two different Biomedical Foundation Models (BMFM), namely, BMFM-DNA-REF in which the model is trained with sequences of varying lengths along with their reverse complements derived from the reference genome and BMFM-DNA-SNP in which the model is trained with sequences created using a novel representation scheme that encodes sequence variations. Our findings indicate that integrating sequence variations into DNALMs helps capture the biological functions as seen in improvements on all fine-tuning tasks. To explore the model's practical utility, we experimented with various strategies for SNP imputation on promoter detection task introduced in DNABERT-2. However, we acknowledge that the current benchmarks are limited in their ability to fully evaluate these models. To enable more comprehensive assessment in the future and encourage community contributions, we release our models through HuggingFace and the code to reproduce the results at https://github.com/BiomedSciAI/biomed-multi-omic
大型语言模型已被应用于解读DNA序列,而整合了单核苷酸多态性等序列变异的新型基础模型,在基因组任务中能更好地捕捉生物学功能。
Large language models have been adapted to decode DNA sequences, and new foundation models integrating sequence variations like SNPs show improved performance in capturing biological functions across genomic tasks.
Authors:Xiang Xu, Lingdong Kong, Song Wang, Chuanwei Zhou, Qingshan Liu
Abstract:
LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDAR-based perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code has be made publicly accessible for future research.
中文: LiMA是一种新颖的长时记忆聚合框架,通过跨视图聚合、长时特征传播和跨序列对齐机制捕捉长程时序关联,显著提升了激光雷达语义分割与3D目标检测性能,且在下游任务中不产生额外计算开销。
English: LiMA is a novel long-term memory aggregation framework that enhances LiDAR representation learning by capturing extended temporal correlations through cross-view aggregation, long-term feature propagation, and cross-sequence alignment, significantly improving performance in semantic segmentation and 3D object detection without added computational costs during downstream tasks.
Authors:Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu, Jing Shao
Abstract:
Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.
中文摘要:本文提出"响应攻击"方法,通过辅助大语言模型生成轻微有害响应来利用上下文启动漏洞,引导目标模型产生违规内容,该方法在八大模型中均优于现有越狱技术,并构建了上下文感知安全微调数据集以降低攻击成功率。
English Summary: This paper introduces Response Attack, a method exploiting contextual priming vulnerabilities in large language models by using an auxiliary LLM to generate mildly harmful responses that steer target models toward policy-violating content, demonstrating superior effectiveness over existing jailbreak techniques while proposing a safety fine-tuning dataset for mitigation.
Authors:Zongyan Han, Mohamed El Amine Boudjoghra, Jiahua Dong, Jinhong Wang, Rao Muhammad Anwer
Abstract:
Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at https://github.com/Hanzy1996/VDG-Uni3DSeg.
中文: VDG-Uni3DSeg提出了一种创新框架,通过融合视觉语言和大语言模型,利用多模态线索和专门设计的模块来增强三维点云分割,在语义、实例和全景分割任务中均取得了领先性能。
English: VDG-Uni3DSeg introduces a novel framework that integrates vision-language and large language models to enhance 3D point cloud segmentation by leveraging multimodal cues and specialized modules, achieving state-of-the-art performance in semantic, instance, and panoptic segmentation.
Authors:Yijia Hong, Jiangning Zhang, Ran Yi, Yuji Wang, Weijian Cao, Xiaobin Hu, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Abstract:
Generating intermediate video content of varying lengths based on given first and last frames, along with text prompt information, offers significant research and application potential. However, traditional frame interpolation tasks primarily focus on scenarios with a small number of frames, no text control, and minimal differences between the first and last frames. Recent community developers have utilized large video models represented by Wan to endow frame-to-frame capabilities. However, these models can only generate a fixed number of frames and often fail to produce satisfactory results for certain frame lengths, while this setting lacks a clear official definition and a well-established benchmark. In this paper, we first propose a new practical Semantic Frame Interpolation (SFI) task from the perspective of academic definition, which covers the above two settings and supports inference at multiple frame rates. To achieve this goal, we propose a novel SemFi model building upon Wan2.1, which incorporates a Mixture-of-LoRA module to ensure the generation of high-consistency content that aligns with control conditions across various frame length limitations. Furthermore, we propose SFI-300K, the first general-purpose dataset and benchmark specifically designed for SFI. To support this, we collect and process data from the perspective of SFI, carefully designing evaluation metrics and methods to assess the model's performance across multiple dimensions, encompassing image and video, and various aspects, including consistency and diversity. Through extensive experiments on SFI-300K, we demonstrate that our method is particularly well-suited to meet the requirements of the SFI task.
中文摘要:本文提出了语义帧插值(SFI)新任务,开发了基于Wan2.1的SemFi模型,通过混合LoRA模块实现在不同帧长限制下生成符合控制条件的高一致性内容,并创建了首个专用数据集SFI-300K进行多维评估验证。
English Summary: This paper introduces the Semantic Frame Interpolation (SFI) task and proposes the SemFi model with a Mixture-of-LoRA module to generate variable-length intermediate videos between given frames while maintaining content consistency and control alignment, supported by the newly created SFI-300K benchmark dataset.
Authors:Benjamin R. Toaz, Shaunak D. Bopardikar
Abstract:
We formulate a vector cost alternative to the scalarization method for weighting and combining multi-objective costs. The algorithm produces solutions to bimatrix games that are simultaneously pure, unique Nash equilibria and Pareto optimal with guarantees for avoiding worst case outcomes. We achieve this by enforcing exact potential game constraints to guide cost adjustments towards equilibrium, while minimizing the deviation from the original cost structure. The magnitude of this adjustment serves as a metric for differentiating between Pareto optimal solutions. We implement this approach in a racing competition between agents with heterogeneous cost structures, resulting in fewer collision incidents with a minimal decrease in performance. Code is available at https://github.com/toazbenj/race_simulation.
中文: 本研究提出了一种向量成本方法替代标量化处理多目标优化,通过势博弈约束在双矩阵博弈中实现纯纳什均衡和帕累托最优,并在智能体赛车模拟中有效减少碰撞且性能损失最小。
English: This study introduces a vector cost method as an alternative to scalarization for multi-objective optimization, achieving pure Nash equilibria and Pareto optimality in bimatrix games through potential game constraints, with demonstrated effectiveness in reducing collisions in agent racing simulations.
Authors:Nusrat Munia, Junfeng Zhu, Olfa Nasraoui, Abdullah-Al-Zubaer Imran
Abstract:
Social networks can be a valuable source of information during crisis events. In particular, users can post a stream of multimodal data that can be critical for real-time humanitarian response. However, effectively extracting meaningful information from this large and noisy data stream and effectively integrating heterogeneous data remains a formidable challenge. In this work, we explore vision language models (VLMs) and advanced fusion strategies to enhance the classification of crisis data in three different tasks. We incorporate LLaVA-generated text to improve text-image alignment. Additionally, we leverage Contrastive Language-Image Pretraining (CLIP)-based vision and text embeddings, which, without task-specific fine-tuning, outperform traditional models. To further refine multimodal fusion, we employ Guided Cross Attention (Guided CA) and combine it with the Differential Attention mechanism to enhance feature alignment by emphasizing critical information while filtering out irrelevant content. Our results show that while Differential Attention improves classification performance, Guided CA remains highly effective in aligning multimodal features. Extensive experiments on the CrisisMMD benchmark data set demonstrate that the combination of pretrained VLMs, enriched textual descriptions, and adaptive fusion strategies consistently outperforms state-of-the-art models in classification accuracy, contributing to more reliable and interpretable models for three different tasks that are crucial for disaster response. Our code is available at https://github.com/Munia03/Multimodal_Crisis_Event.
中文: 本研究通过融合视觉语言模型与先进的多模态融合策略,显著提升了危机数据的分类性能,在灾害应对任务中实现了更优的准确性和可解释性。
English: This study enhances crisis data classification by integrating vision-language models with advanced fusion strategies, achieving superior accuracy and interpretability for disaster response tasks.
Authors:Nicholas Chivaran, Jianbing Ni
Abstract:
The recent proliferation of photorealistic AI-generated images (AIGI) has raised urgent concerns about their potential misuse, particularly on social media platforms. Current state-of-the-art AIGI detection methods typically rely on large, deep neural architectures, creating significant computational barriers to real-time, large-scale deployment on platforms like social media. To challenge this reliance on computationally intensive models, we introduce LAID, the first framework -- to our knowledge -- that benchmarks and evaluates the detection performance and efficiency of off-the-shelf lightweight neural networks. In this framework, we comprehensively train and evaluate selected models on a representative subset of the GenImage dataset across spatial, spectral, and fusion image domains. Our results demonstrate that lightweight models can achieve competitive accuracy, even under adversarial conditions, while incurring substantially lower memory and computation costs compared to current state-of-the-art methods. This study offers valuable insight into the trade-off between efficiency and performance in AIGI detection and lays a foundation for the development of practical, scalable, and trustworthy detection systems. The source code of LAID can be found at: https://github.com/nchivar/LAID.
中文摘要:LAID框架研究表明,轻量级神经网络能以显著降低的计算成本实现与现有方法相媲美的AI生成图像检测精度,为社交媒体平台的可扩展部署提供了实用解决方案。
English summary: The LAID framework demonstrates that lightweight neural networks can achieve competitive accuracy in detecting AI-generated images with significantly lower computational costs, offering a practical solution for scalable deployment on social media platforms.
Authors:Yingyu Yang, Qianye Yang, Kangning Cui, Can Peng, Elena D'Alberti, Netzahualcoyotl Hernandez-Cruz, Olga Patey, Aris T. Papageorghiou, J. Alison Noble
Abstract:
The identification of cardiac phase is an essential step for analysis and diagnosis of cardiac function. Automatic methods, especially data-driven methods for cardiac phase detection, typically require extensive annotations, which is time-consuming and labor-intensive. In this paper, we present an unsupervised framework for end-diastole (ED) and end-systole (ES) detection through self-supervised learning of latent cardiac motion trajectories from 4-chamber-view echocardiography videos. Our method eliminates the need for manual annotations, including ED and ES indices, segmentation, or volumetric measurements, by training a reconstruction model to encode interpretable spatiotemporal motion patterns. Evaluated on the EchoNet-Dynamic benchmark, the approach achieves mean absolute error (MAE) of 3 frames (58.3 ms) for ED and 2 frames (38.8 ms) for ES detection, matching state-of-the-art supervised methods. Extended to fetal echocardiography, the model demonstrates robust performance with MAE 1.46 frames (20.7 ms) for ED and 1.74 frames (25.3 ms) for ES, despite the fact that the fetal heart model is built using non-standardized heart views due to fetal heart positioning variability. Our results demonstrate the potential of the proposed latent motion trajectory strategy for cardiac phase detection in adult and fetal echocardiography. This work advances unsupervised cardiac motion analysis, offering a scalable solution for clinical populations lacking annotated data. Code will be released at https://github.com/YingyuYyy/CardiacPhase.
中文: 本文提出了一种无监督框架,通过自监督学习从超声心动图视频中检测心脏相位而无需人工标注,在成人和胎儿应用中均实现了与有监督方法相当的性能。
English: This paper introduces an unsupervised framework using self-supervised learning to detect cardiac phases from echocardiography videos without manual annotations, achieving performance comparable to supervised methods in both adult and fetal applications.
Authors:Aadi Srivastava, Vignesh Natarajkumar, Utkarsh Bheemanaboyna, Devisree Akashapu, Nagraj Gaonkar, Archit Joshi
Abstract:
The widespread and rapid adoption of AI-generated content, created by models such as Generative Adversarial Networks (GANs) and Diffusion Models, has revolutionized the digital media landscape by allowing efficient and creative content generation. However, these models also blur the difference between real images and AI-generated synthetic images, raising concerns regarding content authenticity and integrity. While many existing solutions to detect fake images focus solely on classification and higher-resolution images, they often lack transparency in their decision-making, making it difficult for users to understand why an image is classified as fake. In this paper, we present VERITAS, a comprehensive framework that not only accurately detects whether a small (32x32) image is AI-generated but also explains why it was classified that way through artifact localization and semantic reasoning. VERITAS produces human-readable explanations that describe key artifacts in synthetic images. We show that this architecture offers clear explanations of the basis of zero-shot synthetic image detection tasks. Code and relevant prompts can be found at https://github.com/V-i-g-n-e-s-h-N/VERITAS .
AI生成内容虽革新了数字媒体,却引发真实性担忧,为此提出VERITAS框架,通过定位伪影和生成可读解释,实现对合成图像的检测与归因分析。
AI-generated content has transformed digital media but challenges authenticity, prompting the development of VERITAS, a framework that detects and explains synthetic images through artifact localization and human-readable reasoning.
Authors:Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, Yanzhi Wang
Abstract:
Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, current VLA models suffer from two drawbacks: (i) generation of massive tokens leading to high inference latency and increased training cost, and (ii) insufficient utilization of generated actions resulting in potential performance loss. To address these issues, we develop a training framework to finetune VLA models for generating significantly fewer action tokens with high parallelism, effectively reducing inference latency and training cost. Furthermore, we introduce an inference optimization technique with a novel voting-based ensemble strategy to combine current and previous action predictions, improving the utilization of generated actions and overall performance. Our results demonstrate that we achieve superior performance compared with state-of-the-art VLA models, achieving significantly higher success rates and 39$\times$ faster inference than OpenVLA with 46 Hz throughput on edge platforms, demonstrating practical deployability. The code is available at https://github.com/LukeLIN-web/VOTE.
中文摘要:本研究提出的训练框架和推理优化技术显著减少了视觉语言动作模型的行动令牌与延迟,同时通过改进行动利用提升了性能,实现了比现有最优方法更快的推理速度和更高的成功率。
English summary: This study introduces a training framework and inference optimization technique that significantly reduces action tokens and latency while enhancing performance in Vision Language Action models, achieving faster inference and higher success rates than state-of-the-art methods.
Authors:Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, Yanzhi Wang
Abstract:
Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, current VLA models suffer from two drawbacks: (i) generation of massive tokens leading to high inference latency and increased training cost, and (ii) insufficient utilization of generated actions resulting in potential performance loss. To address these issues, we develop a training framework to finetune VLA models for generating significantly fewer action tokens with high parallelism, effectively reducing inference latency and training cost. Furthermore, we introduce an inference optimization technique with a novel voting-based ensemble strategy to combine current and previous action predictions, improving the utilization of generated actions and overall performance. Our results demonstrate that we achieve superior performance compared with state-of-the-art VLA models, achieving significantly higher success rates and 39$\times$ faster inference than OpenVLA with 46 Hz throughput on edge platforms, demonstrating practical deployability. The code is available at https://github.com/LukeLIN-web/VOTE.
中文摘要:本研究提出的训练框架和推理优化技术显著减少了视觉语言动作模型的行动令牌与延迟,同时通过改进行动利用提升了性能,实现了比现有最优方法更快的推理速度和更高的成功率。
English summary: This study introduces a training framework and inference optimization technique that significantly reduces action tokens and latency while enhancing performance in Vision Language Action models, achieving faster inference and higher success rates than state-of-the-art methods.
Authors:Binyan Xu, Fan Yang, Xilin Dai, Di Tang, Kehuan Zhang
Abstract:
Deep Neural Networks (DNNs) are susceptible to backdoor attacks, where adversaries poison training data to implant backdoor into the victim model. Current backdoor defenses on poisoned data often suffer from high computational costs or low effectiveness against advanced attacks like clean-label and clean-image backdoors. To address them, we introduce CLIP-Guided backdoor Defense (CGD), an efficient and effective method that mitigates various backdoor attacks. CGD utilizes a publicly accessible CLIP model to identify inputs that are likely to be clean or poisoned. It then retrains the model with these inputs, using CLIP's logits as a guidance to effectively neutralize the backdoor. Experiments on 4 datasets and 11 attack types demonstrate that CGD reduces attack success rates (ASRs) to below 1% while maintaining clean accuracy (CA) with a maximum drop of only 0.3%, outperforming existing defenses. Additionally, we show that clean-data-based defenses can be adapted to poisoned data using CGD. Also, CGD exhibits strong robustness, maintaining low ASRs even when employing a weaker CLIP model or when CLIP itself is compromised by a backdoor. These findings underscore CGD's exceptional efficiency, effectiveness, and applicability for real-world backdoor defense scenarios. Code: https://github.com/binyxu/CGD.
Chinese: 提出的CLIP引导后门防御方法(CGD)利用公开CLIP模型区分并重新训练干净与中毒输入,在多种数据集和攻击类型中实现了接近零的攻击成功率,同时对准确率影响极小。
English: The proposed CLIP-Guided backdoor Defense (CGD) effectively mitigates various backdoor attacks by leveraging a public CLIP model to distinguish and retrain on clean versus poisoned inputs, achieving near-zero attack success rates with minimal impact on accuracy across multiple datasets and attack types.
Authors:Yuyi Zhang, Peirong Zhang, Zhenhua Yang, Pengyu Yan, Yongxin Shi, Pengwei Liu, Fengjun Guo, Lianwen Jin
Abstract:
Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians' restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR's remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83% to 84.05%, with further enhancement to 94.25% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.
中文: 本文提出新型自动化历史文档修复系统(AutoHDR)和完整数据集(FPHDR),通过三阶段工作流程和人机协作显著提升OCR识别准确率,在文化遗产保护领域实现重要突破。
English: This paper introduces a novel automated historical document restoration system (AutoHDR) and a comprehensive dataset (FPHDR), which significantly enhances OCR accuracy through a three-stage workflow and human-machine collaboration, representing a major advancement in cultural heritage preservation.
Authors:Xinzhe Zheng, Hao Du, Fanding Xu, Jinzhe Li, Zhiyuan Liu, Wenkang Wang, Tao Chen, Wanli Ouyang, Stan Z. Li, Yan Lu, Nanqing Dong, Yang Zhang
Abstract:
Deep learning-based computational methods have achieved promising results in predicting protein-protein interactions (PPIs). However, existing benchmarks predominantly focus on isolated pairwise evaluations, overlooking a model's capability to reconstruct biologically meaningful PPI networks, which is crucial for biology research. To address this gap, we introduce PRING, the first comprehensive benchmark that evaluates protein-protein interaction prediction from a graph-level perspective. PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions, with well-designed strategies to address both data redundancy and leakage. Building on this golden-standard dataset, we establish two complementary evaluation paradigms: (1) topology-oriented tasks, which assess intra and cross-species PPI network construction, and (2) function-oriented tasks, including protein complex pathway prediction, GO module analysis, and essential protein justification. These evaluations not only reflect the model's capability to understand the network topology but also facilitate protein function annotation, biological module detection, and even disease mechanism analysis. Extensive experiments on four representative model categories, consisting of sequence similarity-based, naive sequence-based, protein language model-based, and structure-based approaches, demonstrate that current PPI models have potential limitations in recovering both structural and functional properties of PPI networks, highlighting the gap in supporting real-world biological applications. We believe PRING provides a reliable platform to guide the development of more effective PPI prediction models for the community. The dataset and source code of PRING are available at https://github.com/SophieSarceau/PRING.
中文: PRING是首个从图层面评估蛋白质相互作用预测的综合基准,通过多物种网络拓扑和功能特性评估,揭示了现有模型在支持实际生物应用方面的局限性。
English: PRING is the first comprehensive benchmark that evaluates protein-protein interaction prediction from a graph-level perspective, addressing limitations in current models by assessing both network topology and functional properties across multiple species.
Authors:Hongyao Yu, Yixiang Qiu, Yiheng Yang, Hao Fang, Tianqu Zhuang, Jiaxin Hong, Bin Chen, Hao Wu, Shu-Tao Xia
Abstract:
Autoregressive image generation has witnessed rapid advancements, with prominent models such as scale-wise visual auto-regression pushing the boundaries of visual synthesis. However, these developments also raise significant concerns regarding data privacy and copyright. In response, training data detection has emerged as a critical task for identifying unauthorized data usage in model training. To better understand the vulnerability of autoregressive image generative models to such detection, we conduct the first study applying membership inference to this domain. Our approach comprises two key components: implicit classification and an adaptive score aggregation strategy. First, we compute the implicit token-wise classification score within the query image. Then we propose an adaptive score aggregation strategy to acquire a final score, which places greater emphasis on the tokens with lower scores. A higher final score indicates that the sample is more likely to be involved in the training set. To validate the effectiveness of our method, we adapt existing detection algorithms originally designed for LLMs to visual autoregressive models. Extensive experiments demonstrate the superiority of our method in both class-conditional and text-to-image scenarios. Moreover, our approach exhibits strong robustness and generalization under various data transformations. Furthermore, sufficient experiments suggest two novel key findings: (1) A linear scaling law on membership inference, exposing the vulnerability of large foundation models. (2) Training data from scale-wise visual autoregressive models is easier to detect than other autoregressive paradigms.Our code is available at https://github.com/Chrisqcwx/ImageAR-MIA.
中文: 本研究首次针对自回归图像生成模型提出成员推理方法,通过隐式分类与自适应分数聚合策略有效检测训练数据滥用,并揭示大型基础模型的脆弱性及不同范式间的检测差异。
English: This study introduces the first membership inference method for autoregressive image generative models, combining implicit classification with adaptive score aggregation to effectively detect unauthorized training data usage and revealing vulnerabilities in large foundation models.
Authors:Jan Carreras Boada, Rao Muhammad Umer, Carsten Marr
Abstract:
Biomedical datasets are often constrained by stringent privacy requirements and frequently suffer from severe class imbalance. These two aspects hinder the development of accurate machine learning models. While generative AI offers a promising solution, producing synthetic images of sufficient quality for training robust classifiers remains challenging. This work addresses the classification of individual white blood cells, a critical task in diagnosing hematological malignancies such as acute myeloid leukemia (AML). We introduce CytoDiff, a stable diffusion model fine-tuned with LoRA weights and guided by few-shot samples that generates high-fidelity synthetic white blood cell images. Our approach demonstrates substantial improvements in classifier performance when training data is limited. Using a small, highly imbalanced real dataset, the addition of 5,000 synthetic images per class improved ResNet classifier accuracy from 27\% to 78\% (+51\%). Similarly, CLIP-based classification accuracy increased from 62\% to 77\% (+15\%). These results establish synthetic image generation as a valuable tool for biomedical machine learning, enhancing data coverage and facilitating secure data sharing while preserving patient privacy. Paper code is publicly available at https://github.com/JanCarreras24/CytoDiff.
中文摘要:生物医学数据集常受隐私限制和类别不平衡的困扰,影响机器学习准确性,而CytoDiff这一稳定扩散模型通过生成高质量合成白细胞图像,显著提升了分类器性能,将ResNet准确率提高51%、CLIP提高15%。
English Summary: Biomedical datasets face privacy constraints and class imbalance, hindering accurate machine learning, but CytoDiff, a stable diffusion model, generates high-fidelity synthetic white blood cell images that significantly boost classifier performance, improving ResNet accuracy by 51% and CLIP by 15%.
Authors:Katarina C. Poole, Julie Meyer, Vincent Martin, Rapolas Daugintis, Nils Marggraf-Turley, Jack Webb, Ludovic Pirard, Nicola La Magna, Oliver Turvey, Lorenzo Picinali
Abstract:
Headphone-based spatial audio uses head-related transfer functions (HRTFs) to simulate real-world acoustic environments. HRTFs are unique to everyone, due to personal morphology, shaping how sound waves interact with the body before reaching the eardrums. Here we present the extended SONICOM HRTF dataset which expands on the previous version released in 2023. The total number of measured subjects has now been increased to 300, with demographic information for a subset of the participants, providing context for the dataset's population and relevance. The dataset incorporates synthesised HRTFs for 200 of the 300 subjects, generated using Mesh2HRTF, alongside pre-processed 3D scans of the head and ears, optimised for HRTF synthesis. This rich dataset facilitates rapid and iterative optimisation of HRTF synthesis algorithms, allowing the automatic generation of large data. The optimised scans enable seamless morphological modifications, providing insights into how anatomical changes impact HRTFs, and the larger sample size enhances the effectiveness of machine learning approaches. To support analysis, we also introduce the Spatial Audio Metrics (SAM) Toolbox, a Python package designed for efficient analysis and visualisation of HRTF data, offering customisable tools for advanced research. Together, the extended dataset and toolbox offer a comprehensive resource for advancing personalised spatial audio research and development.
中文:扩展的SONICOM HRTF数据集将测量对象增至300人,包含合成HRTF和3D扫描数据,支持快速算法优化与解剖特征影响研究,同时新推出的SAM工具箱提供基于Python的分析工具,共同推动个性化空间音频研发。
English: The extended SONICOM HRTF dataset expands to 300 subjects with synthesized HRTFs and 3D scans, enabling rapid algorithm optimization and anatomical impact studies, while the new SAM Toolbox provides Python-based analysis tools for advancing personalized spatial audio research.
Authors:Ricardo Cardoso, Plinio Moreno
Abstract:
Inertial mass plays a crucial role in robotic applications such as object grasping, manipulation, and simulation, providing a strong prior for planning and control. Accurately estimating an object's mass before interaction can significantly enhance the performance of various robotic tasks. However, mass estimation using only vision sensors is a relatively underexplored area. This paper proposes a novel approach combining sparse point-cloud data from depth images with RGB images to estimate the mass of objects. We evaluate a range of point-cloud processing architectures, alongside RGB-only methods. To overcome the limited availability of training data, we create a synthetic dataset using ShapeNetSem 3D models, simulating RGBD images via a Kinect camera. This synthetic data is used to train an image generation model for estimating dense depth maps, which we then use to augment an existing dataset of images paired with mass values. Our approach significantly outperforms existing benchmarks across all evaluated metrics. The data generation (https://github.com/RavineWindteer/ShapenetSem-to-RGBD) as well as the training of the depth estimator (https://github.com/RavineWindteer/GLPDepth-Edited) and the mass estimator (https://github.com/RavineWindteer/Depth-mass-estimator) are available online.
Chinese: 本文提出了一种结合深度图像稀疏点云与RGB图像的新方法,通过合成数据克服训练数据不足,实现了物体质量的精确估计,并在各项评估指标上显著优于现有基准。
English: This paper introduces a novel method that combines sparse point-cloud data from depth images with RGB images to accurately estimate object mass, significantly outperforming existing benchmarks by using synthetic data to overcome training data limitations.
Authors:Soham Walimbe, Britty Baby, Vinkle Srivastav, Nicolas Padoy
Abstract:
Surgical AI often involves multiple tasks within a single procedure, like phase recognition or assessing the Critical View of Safety in laparoscopic cholecystectomy. Traditional models, built for one task at a time, lack flexibility, requiring a separate model for each. To address this, we introduce MML-SurgAdapt, a unified multi-task framework with Vision-Language Models (VLMs), specifically CLIP, to handle diverse surgical tasks through natural language supervision. A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks. To overcome this, we employ Single Positive Multi-Label (SPML) learning, which traditionally reduces annotation burden by training models with only one positive label per instance. Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations. We demonstrate the effectiveness of our model on a combined dataset consisting of Cholec80, Endoscapes2023, and CholecT50, utilizing custom prompts. Extensive evaluation shows that MML-SurgAdapt performs comparably to task-specific benchmarks, with the added advantage of handling noisy annotations. It also outperforms the existing SPML frameworks for the task. By reducing the required labels by 23%, our approach proposes a more scalable and efficient labeling process, significantly easing the annotation burden on clinicians. To our knowledge, this is the first application of SPML to integrate data from multiple surgical tasks, presenting a novel and generalizable solution for multi-task learning in surgical computer vision. Implementation is available at: https://github.com/CAMMA-public/MML-SurgAdapt
中文:MML-SurgAdapt提出了一种基于视觉语言模型的统一多任务框架,通过自然语言监督处理多种外科手术任务,采用单正例多标签学习方法有效应对部分标注问题,在减少23%标注需求的同时保持了与专用模型相当的性能。
English: MML-SurgAdapt introduces a unified multi-task framework using Vision-Language Models to handle diverse surgical tasks with natural language supervision, effectively addressing partial annotations through Single Positive Multi-Label learning and reducing labeling requirements by 23% while maintaining performance comparable to task-specific models.
Authors:Britty Baby, Vinkle Srivastav, Pooja P. Jain, Kun Yuan, Pietro Mascagni, Nicolas Padoy
Abstract:
The Critical View of Safety (CVS) is crucial for safe laparoscopic cholecystectomy, yet assessing CVS criteria remains a complex and challenging task, even for experts. Traditional models for CVS recognition depend on vision-only models learning with costly, labor-intensive spatial annotations. This study investigates how text can be harnessed as a powerful tool for both training and inference in multi-modal surgical foundation models to automate CVS recognition. Unlike many existing multi-modal models, which are primarily adapted for multi-class classification, CVS recognition requires a multi-label framework. Zero-shot evaluation of existing multi-modal surgical models shows a significant performance gap for this task. To address this, we propose CVS-AdaptNet, a multi-label adaptation strategy that enhances fine-grained, binary classification across multiple labels by aligning image embeddings with textual descriptions of each CVS criterion using positive and negative prompts. By adapting PeskaVLP, a state-of-the-art surgical foundation model, on the Endoscapes-CVS201 dataset, CVS-AdaptNet achieves 57.6 mAP, improving over the ResNet50 image-only baseline (51.5 mAP) by 6 points. Our results show that CVS-AdaptNet's multi-label, multi-modal framework, enhanced by textual prompts, boosts CVS recognition over image-only methods. We also propose text-specific inference methods, that helps in analysing the image-text alignment. While further work is needed to match state-of-the-art spatial annotation-based methods, this approach highlights the potential of adapting generalist models to specialized surgical tasks. Code: https://github.com/CAMMA-public/CVS-AdaptNet
中文: 本研究提出CVS-AdaptNet多模态框架,通过图像嵌入与文本提示的对齐,显著提升了腹腔镜胆囊切除术中安全临界视图的识别能力,相比纯图像方法将mAP提高了6个百分点。
English: This study introduces CVS-AdaptNet, a multi-modal framework that enhances Critical View of Safety recognition in laparoscopic cholecystectomy by aligning image embeddings with textual prompts, achieving a 6-point mAP improvement over image-only methods.
Authors:Qinkai Yu, Jianyang Xie, Yitian Zhao, Cheng Chen, Lijun Zhang, Liming Chen, Jun Cheng, Lu Liu, Yalin Zheng, Yanda Meng
Abstract:
Multimodal ophthalmic imaging-based diagnosis integrates color fundus image with optical coherence tomography (OCT) to provide a comprehensive view of ocular pathologies. However, the uneven global distribution of healthcare resources often results in real-world clinical scenarios encountering incomplete multimodal data, which significantly compromises diagnostic accuracy. Existing commonly used pipelines, such as modality imputation and distillation methods, face notable limitations: 1)Imputation methods struggle with accurately reconstructing key lesion features, since OCT lesions are localized, while fundus images vary in style. 2)distillation methods rely heavily on fully paired multimodal training data. To address these challenges, we propose a novel multimodal alignment and fusion framework capable of robustly handling missing modalities in the task of ophthalmic diagnostics. By considering the distinctive feature characteristics of OCT and fundus images, we emphasize the alignment of semantic features within the same category and explicitly learn soft matching between modalities, allowing the missing modality to utilize existing modality information, achieving robust cross-modal feature alignment under the missing modality. Specifically, we leverage the Optimal Transport for multi-scale modality feature alignment: class-wise alignment through predicted class prototypes and feature-wise alignment via cross-modal shared feature transport. Furthermore, we propose an asymmetric fusion strategy that effectively exploits the distinct characteristics of OCT and fundus modalities. Extensive evaluations on three large ophthalmic multimodal datasets demonstrate our model's superior performance under various modality-incomplete scenarios, achieving Sota performance in both complete modality and inter-modality incompleteness conditions. Code is available at https://github.com/Qinkaiyu/RIMA
中文总结:本文提出了一种新颖的多模态对齐融合框架,通过最优传输实现特征对齐和采用非对称融合策略,能够稳健处理眼科诊断中OCT或眼底图像缺失的情况,在各种不完整数据场景下均展现出最优性能。
English Summary: This paper introduces a novel multimodal alignment and fusion framework that robustly handles missing OCT or fundus image modalities in ophthalmic diagnostics by employing optimal transport for feature alignment and asymmetric fusion strategies, demonstrating state-of-the-art performance across various incomplete data scenarios.
Authors:Qinkai Yu, Wei Zhou, Hantao Liu, Yanyu Xu, Meng Wang, Yitian Zhao, Huazhu Fu, Xujiong Ye, Yalin Zheng, Yanda Meng
Abstract:
As a long-term complication of diabetes, diabetic retinopathy (DR) progresses slowly, potentially taking years to threaten vision. An accurate and robust evaluation of its severity is vital to ensure prompt management and care. Ordinal regression leverages the underlying inherent order between categories to achieve superior performance beyond traditional classification. However, there exist challenges leading to lower DR classification performance: 1) The uneven distribution of DR severity levels, characterized by a long-tailed pattern, adds complexity to the grading process. 2)The ambiguity in defining category boundaries introduces additional challenges, making the classification process more complex and prone to inconsistencies. This work proposes a novel autoregressive ordinal regression method called AOR-DR to address the above challenges by leveraging the clinical knowledge of inherent ordinal information in DR grading dataset settings. Specifically, we decompose the DR grading task into a series of ordered steps by fusing the prediction of the previous steps with extracted image features as conditions for the current prediction step. Additionally, we exploit the diffusion process to facilitate conditional probability modeling, enabling the direct use of continuous global image features for autoregression without relearning contextual information from patch-level features. This ensures the effectiveness of the autoregressive process and leverages the capabilities of pre-trained large-scale foundation models. Extensive experiments were conducted on four large-scale publicly available color fundus datasets, demonstrating our model's effectiveness and superior performance over six recent state-of-the-art ordinal regression methods. The implementation code is available at https://github.com/Qinkaiyu/AOR-DR.
Chinese: 本研究提出AOR-DR自回归序数回归方法,通过利用糖尿病视网膜病变分级中的临床序数信息和扩散过程,解决了类别不平衡和边界模糊问题,在多个公开数据集上展现出优越性能。
English: This study introduces AOR-DR, an autoregressive ordinal regression method that addresses challenges in diabetic retinopathy grading by leveraging clinical ordinal information and diffusion processes to improve classification accuracy across imbalanced datasets.
Authors:Yingshan Liang, Keyu Fan, Zhicheng Du, Yiran Wang, Qingyang Shi, Xinyu Zhang, Jiasheng Lu, Peiwu Qin
Abstract:
Video-to-audio (V2A) generation shows great potential in fields such as film production. Despite significant advances, current V2A methods relying on global video information struggle with complex scenes and generating audio tailored to specific objects. To address these limitations, we introduce Hear-Your-Click, an interactive V2A framework enabling users to generate sounds for specific objects by clicking on the frame. To achieve this, we propose Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual Encoder (MVE) to obtain object-level visual features aligned with audio. Furthermore, we tailor two data augmentation strategies, Random Video Stitching (RVS) and Mask-guided Loudness Modulation (MLM), to enhance the model's sensitivity to segmented objects. To measure audio-visual correspondence, we designed a new evaluation metric, the CAV score. Extensive experiments demonstrate that our framework offers more precise control and improves generation performance across various metrics. Project Page: https://github.com/SynapGrid/Hear-Your-Click
中文:Hear-Your-Click框架通过让用户点击视频帧中的特定对象,结合对象感知的视觉特征和定制化数据增强,实现了交互式的视频到音频生成,从而提升了控制的精确性和生成效果。
English: The Hear-Your-Click framework enables interactive video-to-audio generation by allowing users to click on specific objects in a frame, utilizing object-aware visual features and tailored data augmentation to improve precision and performance.
Authors:Kefan Tang, Lihuo He, Jisheng Dang, Xinbo Gao
Abstract:
Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model's tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model's robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at https://github.com/Tangkfan/CICR.
中文摘要:时序语句定位(TSG)旨在将视频片段与文本查询相匹配,但现有模型因文本偏见和视频过拟合而产生虚假关联,而本文提出的因果干预与反事实推理框架通过消除这些干扰因素显著提升了模型的泛化能力。
English Summary: Temporal Sentence Grounding (TSG) aims to match video moments with text queries, but current models suffer from spurious correlations caused by textual biases and video overfitting, which this new framework addresses through causal intervention and counterfactual reasoning to improve robustness.
Authors:Thinh Dao, Dung Thuy Nguyen, Khoa D Doan, Kok-Seng Wong
Abstract:
Federated Learning (FL) systems are vulnerable to backdoor attacks, where adversaries train their local models on poisoned data and submit poisoned model updates to compromise the global model. Despite numerous proposed attacks and defenses, divergent experimental settings, implementation errors, and unrealistic assumptions hinder fair comparisons and valid conclusions about their effectiveness in real-world scenarios. To address this, we introduce BackFed - a comprehensive benchmark suite designed to standardize, streamline, and reliably evaluate backdoor attacks and defenses in FL, with a focus on practical constraints. Our benchmark offers key advantages through its multi-processing implementation that significantly accelerates experimentation and the modular design that enables seamless integration of new methods via well-defined APIs. With a standardized evaluation pipeline, we envision BackFed as a plug-and-play environment for researchers to comprehensively and reliably evaluate new attacks and defenses. Using BackFed, we conduct large-scale studies of representative backdoor attacks and defenses across both Computer Vision and Natural Language Processing tasks with diverse model architectures and experimental settings. Our experiments critically assess the performance of proposed attacks and defenses, revealing unknown limitations and modes of failures under practical conditions. These empirical insights provide valuable guidance for the development of new methods and for enhancing the security of FL systems. Our framework is openly available at https://github.com/thinh-dao/BackFed.
中文摘要:BackFed基准套件标准化并加速了联邦学习中后门攻击与防御的评估,通过系统性实验揭示现有方法的局限性,为提升联邦学习安全性提供重要指导。
English Summary: The BackFed benchmark suite standardizes and accelerates the evaluation of backdoor attacks and defenses in Federated Learning, enabling comprehensive assessments that reveal limitations and guide future security improvements.
Authors:Johannes Künzel, Anna Hilsmann, Peter Eisert
Abstract:
We introduce RIPE, an innovative reinforcement learning-based framework for weakly-supervised training of a keypoint extractor that excels in both detection and description tasks. In contrast to conventional training regimes that depend heavily on artificial transformations, pre-generated models, or 3D data, RIPE requires only a binary label indicating whether paired images represent the same scene. This minimal supervision significantly expands the pool of training data, enabling the creation of a highly generalized and robust keypoint extractor.
RIPE utilizes the encoder's intermediate layers for the description of the keypoints with a hyper-column approach to integrate information from different scales. Additionally, we propose an auxiliary loss to enhance the discriminative capability of the learned descriptors.
Comprehensive evaluations on standard benchmarks demonstrate that RIPE simplifies data preparation while achieving competitive performance compared to state-of-the-art techniques, marking a significant advancement in robust keypoint extraction and description. To support further research, we have made our code publicly available at https://github.com/fraunhoferhhi/RIPE.
中文: RIPE是一种基于强化学习的框架,仅需二元场景标签即可弱监督训练关键点提取器,通过超列方法和辅助损失函数在检测与描述任务中实现优异性能,显著简化数据准备并提升鲁棒性。
English: RIPE is a reinforcement learning framework that trains a keypoint extractor with minimal supervision using only binary scene labels, achieving robust performance in detection and description tasks through a hyper-column approach and auxiliary loss.
Authors:Abiao Li, Chenlei Lv, Yuming Fang, Yifan Zuo, Jian Zhang, Guofeng Mei
Abstract:
Most masked point cloud modeling (MPM) methods follow a regression paradigm to reconstruct the coordinate or feature of masked regions. However, they tend to over-constrain the model to learn the details of the masked region, resulting in failure to capture generalized features. To address this limitation, we propose \textbf{\textit{PointGAC}}, a novel clustering-based MPM method that aims to align the feature distribution of masked regions. Specially, it features an online codebook-guided teacher-student framework. Firstly, it presents a geometry-aware partitioning strategy to extract initial patches. Then, the teacher model updates a codebook via online k-means based on features extracted from the complete patches. This procedure facilitates codebook vectors to become cluster centers. Afterward, we assigns the unmasked features to their corresponding cluster centers, and the student model aligns the assignment for the reconstructed masked features. This strategy focuses on identifying the cluster centers to which the masked features belong, enabling the model to learn more generalized feature representations. Benefiting from a proposed codebook maintenance mechanism, codebook vectors are actively updated, which further increases the efficiency of semantic feature learning. Experiments validate the effectiveness of the proposed method on various downstream tasks. Code is available at https://github.com/LAB123-tech/PointGAC
中文摘要:PointGAC提出了一种基于聚类的掩码点云建模方法,通过在线码本引导的师生框架对齐特征分布,避免了过度约束重建细节的问题,从而学习到更具泛化性的特征表示。
English Summary: PointGAC introduces a clustering-based masked point cloud modeling method that aligns feature distributions through an online codebook-guided teacher-student framework, enabling more generalized feature learning without over-constraining reconstruction details.
Authors:Josep Domingo-Ferrer, Najeeb Jebreel, David Sánchez
Abstract:
Privacy protection laws, such as the GDPR, grant individuals the right to request the forgetting of their personal data not only from databases but also from machine learning (ML) models trained on them. Machine unlearning has emerged as a practical means to facilitate model forgetting of data instances seen during training. Although some existing machine unlearning methods guarantee exact forgetting, they are typically costly in computational terms. On the other hand, more affordable methods do not offer forgetting guarantees and are applicable only to specific ML models. In this paper, we present \emph{efficient unlearning with privacy guarantees} (EUPG), a novel machine unlearning framework that offers formal privacy guarantees to individuals whose data are being unlearned. EUPG involves pre-training ML models on data protected using privacy models, and it enables {\em efficient unlearning with the privacy guarantees offered by the privacy models in use}. Through empirical evaluation on four heterogeneous data sets protected with $k$-anonymity and $ε$-differential privacy as privacy models, our approach demonstrates utility and forgetting effectiveness comparable to those of exact unlearning methods, while significantly reducing computational and storage costs. Our code is available at https://github.com/najeebjebreel/EUPG.
中文:EUPG框架通过使用隐私保护数据预训练模型,实现了具有正式隐私保障的高效机器遗忘,在保持与精确遗忘方法相当的效用和遗忘效果的同时,显著降低了计算和存储成本。
English: The EUPG framework enables efficient machine unlearning with formal privacy guarantees by pre-training models on privacy-protected data, achieving comparable utility and forgetting effectiveness to exact methods while significantly reducing computational and storage costs.
Authors:Seyedarmin Azizi, Erfan Baghaei Potraghloo, Massoud Pedram
Abstract:
Large language models (LLMs) excel at complex reasoning when they include intermediate steps, known as "chains of thought" (CoTs). However, these rationales are often overly verbose, even for simple problems, leading to wasted context, increased latency, and higher energy consumption. We observe that verbose, English-heavy CoTs and concise, math-centric CoTs occupy distinct regions in the model's residual-stream activation space. By extracting and injecting a "steering vector" to transition between these modes, we can reliably shift generation toward more concise reasoning, effectively compressing CoTs without retraining. We formalize this approach as Activation-Steered Compression (ASC), an inference-time technique that shortens reasoning traces by directly modifying hidden representations. In addition, we provide a theoretical analysis of the impact of ASC on the output distribution, derived from a closed-form KL-divergence-bounded constraint to regulate steering strength. Using only 100 paired verbose and concise examples, ASC achieves up to 67.43% reduction in CoT length on MATH500 and GSM8K datasets, while maintaining accuracy across 7B, 8B, and 32B parameter models. As a training-free method, ASC introduces negligible runtime overhead and, on MATH500, delivers an average 2.73x speedup in end-to-end reasoning wall-clock time on an 8B model. This makes ASC a practical and efficient tool for streamlining the deployment of reasoning-capable LLMs in latency- or cost-sensitive settings. The code is available at: https://github.com/ArminAzizi98/ASC
中文: 激活导向压缩(ASC)是一种无需训练的技术,通过修改隐藏表征来缩短大型语言模型中冗长的思维链,在保持准确性的同时显著减少推理长度并提高速度。
English: Activation-Steered Compression (ASC) is a training-free technique that shortens verbose chains of thought in large language models by modifying hidden representations, achieving significant length reduction and faster reasoning while maintaining accuracy.
Authors:Anbang Wang, Marawan Elbatel, Keyuan Liu, Lizhuo Lin, Meng Lan, Yanqi Yang, Xiaomeng Li
Abstract:
Accurate detection of anatomic landmarks is essential for assessing alveolar bone and root conditions, thereby optimizing clinical outcomes in orthodontics, periodontics, and implant dentistry. Manual annotation of landmarks on cone-beam computed tomography (CBCT) by dentists is time-consuming, labor-intensive, and subject to inter-observer variability. Deep learning-based automated methods present a promising approach to streamline this process efficiently. However, the scarcity of training data and the high cost of expert annotations hinder the adoption of conventional deep learning techniques. To overcome these challenges, we introduce GeoSapiens, a novel few-shot learning framework designed for robust dental landmark detection using limited annotated CBCT of anterior teeth. Our GeoSapiens framework comprises two key components: (1) a robust baseline adapted from Sapiens, a foundational model that has achieved state-of-the-art performance in human-centric vision tasks, and (2) a novel geometric loss function that improves the model's capacity to capture critical geometric relationships among anatomical structures. Experiments conducted on our collected dataset of anterior teeth landmarks revealed that GeoSapiens surpassed existing landmark detection methods, outperforming the leading approach by an 8.18% higher success detection rate at a strict 0.5 mm threshold-a standard widely recognized in dental diagnostics. Code is available at: https://github.com/xmed-lab/GeoSapiens.
Chinese: GeoSapiens是一种新颖的小样本学习框架,通过采用稳健的基线模型和几何损失函数,在CBCT扫描中提升了牙科标志点检测的准确性,以严格的0.5毫米阈值计算,其成功检测率比现有领先方法高出8.18%。
English: GeoSapiens is a novel few-shot learning framework that enhances dental landmark detection on CBCT scans by leveraging a robust baseline model and a geometric loss function, achieving an 8.18% higher success rate at a strict 0.5 mm threshold compared to leading methods.
Authors:Wanchang Yu, Qing Zhang, Rongjia Zheng, Wei-Shi Zheng
Abstract:
We present a diffusion-based portrait shadow removal approach that can robustly produce high-fidelity results. Unlike previous methods, we cast shadow removal as diffusion-based inpainting. To this end, we first train a shadow-independent structure extraction network on a real-world portrait dataset with various synthetic lighting conditions, which allows to generate a shadow-independent structure map including facial details while excluding the unwanted shadow boundaries. The structure map is then used as condition to train a structure-guided inpainting diffusion model for removing shadows in a generative manner. Finally, to restore the fine-scale details (e.g., eyelashes, moles and spots) that may not be captured by the structure map, we take the gradients inside the shadow regions as guidance and train a detail restoration diffusion model to refine the shadow removal result. Extensive experiments on the benchmark datasets show that our method clearly outperforms existing methods, and is effective to avoid previously common issues such as facial identity tampering, shadow residual, color distortion, structure blurring, and loss of details. Our code is available at https://github.com/wanchang-yu/Structure-Guided-Diffusion-for-Portrait-Shadow-Removal.
Chinese: 我们提出的基于扩散的肖像阴影去除方法,通过提取无阴影结构图并采用引导修复和细节恢复扩散模型,有效消除阴影且避免身份篡改、残留阴影等常见问题。
English: Our diffusion-based approach removes portrait shadows by first extracting a shadow-independent structure map and then using guided inpainting and detail restoration diffusion models to achieve high-fidelity results without common artifacts.
Authors:Changsong Lei, Yaqian Liang, Shaofeng Wang, Jiajia Dai, Yong-Jin Liu
Abstract:
Digital orthodontics represents a prominent and critical application of computer vision technology in the medical field. So far, the labor-intensive process of collecting clinical data, particularly in acquiring paired 3D orthodontic teeth models, constitutes a crucial bottleneck for developing tooth arrangement neural networks. Although numerous general 3D shape generation methods have been proposed, most of them focus on single-object generation and are insufficient for generating anatomically structured teeth models, each comprising 24-32 segmented teeth. In this paper, we propose TeethGenerator, a novel two-stage framework designed to synthesize paired 3D teeth models pre- and post-orthodontic, aiming to facilitate the training of downstream tooth arrangement networks. Specifically, our approach consists of two key modules: (1) a teeth shape generation module that leverages a diffusion model to learn the distribution of morphological characteristics of teeth, enabling the generation of diverse post-orthodontic teeth models; and (2) a teeth style generation module that synthesizes corresponding pre-orthodontic teeth models by incorporating desired styles as conditional inputs. Extensive qualitative and quantitative experiments demonstrate that our synthetic dataset aligns closely with the distribution of real orthodontic data, and promotes tooth alignment performance significantly when combined with real data for training. The code and dataset are available at https://github.com/lcshhh/teeth_generator.
中文摘要:数字正畸面临训练神经网络所需配对3D牙齿模型数据收集的瓶颈,而TeethGenerator通过两阶段框架生成治疗前后的配对3D牙齿模型,结合真实数据显著提升了牙齿排列性能。
English Summary: Digital orthodontics faces a bottleneck in data collection for training neural networks, which TeethGenerator addresses by using a two-stage framework to synthesize paired 3D teeth models before and after treatment, enhancing tooth alignment performance when combined with real data.
Authors:Maolin Wang, Tianshuo Wei, Sheng Zhang, Ruocheng Guo, Wanyu Wang, Shanshan Ye, Lixin Zou, Xuetao Wei, Xiangyu Zhao
Abstract:
Neural Architecture Search (NAS) has emerged as a powerful approach for automating neural network design. However, existing NAS methods face critical limitations in real-world deployments: architectures lack adaptability across scenarios, each deployment context requires costly separate searches, and performance consistency across diverse platforms remains challenging. We propose DANCE (Dynamic Architectures with Neural Continuous Evolution), which reformulates architecture search as a continuous evolution problem through learning distributions over architectural components. DANCE introduces three key innovations: a continuous architecture distribution enabling smooth adaptation, a unified architecture space with learned selection gates for efficient sampling, and a multi-stage training strategy for effective deployment optimization. Extensive experiments across five datasets demonstrate DANCE's effectiveness. Our method consistently outperforms state-of-the-art NAS approaches in terms of accuracy while significantly reducing search costs. Under varying computational constraints, DANCE maintains robust performance while smoothly adapting architectures to different hardware requirements. The code and appendix can be found at https://github.com/Applied-Machine-Learning-Lab/DANCE.
中文:DANCE通过将架构搜索转化为连续演化问题,实现了跨场景自适应的高效网络设计,在显著降低搜索成本的同时,在不同硬件约束下均能保持优异的性能表现。
English: DANCE introduces a continuous evolution approach to neural architecture search, enabling adaptive and efficient network design with reduced costs and robust performance across diverse scenarios and hardware constraints.
Authors:Maolin Wang, Yutian Xiao, Binhao Wang, Sheng Zhang, Shanshan Ye, Wanyu Wang, Hongzhi Yin, Ruocheng Guo, Zenglin Xu
Abstract:
Modern recommendation systems face significant challenges in processing multimodal sequential data, particularly in temporal dynamics modeling and information flow coordination. Traditional approaches struggle with distribution discrepancies between heterogeneous features and noise interference in multimodal signals. We propose \textbf{FindRec}~ (\textbf{F}lexible unified \textbf{in}formation \textbf{d}isentanglement for multi-modal sequential \textbf{Rec}ommendation), introducing a novel "information flow-control-output" paradigm. The framework features two key innovations: (1) A Stein kernel-based Integrated Information Coordination Module (IICM) that theoretically guarantees distribution consistency between multimodal features and ID streams, and (2) A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance. Our approach leverages multi-head subspace decomposition for routing stability and RBF-Stein gradient for unbiased distribution alignment, enhanced by linear-complexity Mamba layers for efficient temporal modeling. Extensive experiments on three real-world datasets demonstrate FindRec's superior performance over state-of-the-art baselines, particularly in handling long sequences and noisy multimodal inputs. Our framework achieves both improved recommendation accuracy and enhanced model interpretability through its modular design. The implementation code is available anonymously online for easy reproducibility~\footnote{https://github.com/Applied-Machine-Learning-Lab/FindRec}.
中文总结:FindRec提出了一种灵活的多模态序列推荐框架,通过信息流控制输出范式,结合分布一致性保证和跨模态自适应路由机制,有效解决了多模态数据中的分布差异和噪声问题,显著提升了推荐准确性和可解释性。
English Summary: FindRec introduces a flexible framework for multimodal sequential recommendation, addressing distribution discrepancies and noise through an information flow-control-output paradigm with innovations in distribution alignment and adaptive feature routing, achieving superior accuracy and interpretability.
Authors:Tuan Dang, Manfred Huber
Abstract:
Navigation is a fundamental capacity for mobile robots, enabling them to operate autonomously in complex and dynamic environments. Conventional approaches use probabilistic models to localize robots and build maps simultaneously using sensor observations. Recent approaches employ human-inspired learning, such as imitation and reinforcement learning, to navigate robots more effectively. However, these methods suffer from high computational costs, global map inconsistency, and poor generalization to unseen environments. This paper presents a novel method inspired by how humans perceive and navigate themselves effectively in novel environments. Specifically, we first build local frames that mimic how humans represent essential spatial information in the short term. Points in local frames are hybrid representations, including spatial information and learned features, so-called spatial-implicit local frames. Then, we integrate spatial-implicit local frames into the global topological map represented as a factor graph. Lastly, we developed a novel navigation algorithm based on Rapid-Exploring Random Tree Star (RRT*) that leverages spatial-implicit local frames and the topological map to navigate effectively in environments. To validate our approach, we conduct extensive experiments in real-world datasets and in-lab environments. We open our source code at https://github.com/tuantdang/simn}{https://github.com/tuantdang/simn.
中文摘要:本文提出了一种新颖的机器人导航方法,通过模拟人类感知建立空间隐式局部框架并整合到全局拓扑图中,在真实世界和实验室环境中进行了广泛实验验证。
English Summary: This paper introduces a novel navigation method for mobile robots that mimics human perception by creating spatial-implicit local frames and integrating them into a global topological map, validated through extensive experiments in real-world and lab environments.
Authors:Daqi Huang, Zhehao Cai, Yuzhi Hao, Zechen Li, Chee-Meng Chew
Abstract:
Robust imitation learning for robot manipulation requires comprehensive 3D perception, yet many existing methods struggle in cluttered environments. Fixed camera view approaches are vulnerable to perspective changes, and 3D point cloud techniques often limit themselves to keyframes predictions, reducing their efficacy in dynamic, contact-intensive tasks. To address these challenges, we propose PRISM, designed as an end-to-end framework that directly learns from raw point cloud observations and robot states, eliminating the need for pretrained models or external datasets. PRISM comprises three main components: a segmentation embedding unit that partitions the raw point cloud into distinct object clusters and encodes local geometric details; a cross-attention component that merges these visual features with processed robot joint states to highlight relevant targets; and a diffusion module that translates the fused representation into smooth robot actions. With training on 100 demonstrations per task, PRISM surpasses both 2D and 3D baseline policies in accuracy and efficiency within our simulated environments, demonstrating strong robustness in complex, object-dense scenarios. Code and some demos are available on https://github.com/czknuaa/PRISM.
Chinese: PRISM 是一种端到端的模仿学习框架,通过处理原始点云和机器人状态生成流畅动作,在仅需每任务100次演示的情况下,于复杂密集场景中超越了现有方法的性能。
English: PRISM is an end-to-end imitation learning framework that processes raw point clouds and robot states to generate smooth actions, outperforming existing methods in cluttered environments with only 100 demonstrations per task.
Authors:Yun Wang, Longguang Wang, Chenghao Zhang, Yongjian Zhang, Zhanjie Zhang, Ao Ma, Chenyou Fan, Tin Lun Lam, Junjie Hu
Abstract:
Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at \textcolor{red}{https://github.com/cocowy1/SMoE-Stereo}.
中文:提出的SMoEStereo框架通过将视觉基础模型与自适应MoE-LoRA模块及轻量决策网络相结合,无需针对特定数据集调整即可实现最先进的跨域性能,显著提升了立体匹配的鲁棒性。
English: The proposed SMoEStereo framework enhances stereo matching robustness by integrating Vision Foundation Models with adaptive MoE-LoRA modules and a lightweight decision network, achieving state-of-the-art cross-domain performance without dataset-specific tuning.
Authors:Shengli Zhou, Yang Liu, Feng Zheng
Abstract:
3D Visual Question Answering (3D VQA) is crucial for enabling models to perceive the physical world and perform spatial reasoning. In 3D VQA, the free-form nature of answers often leads to improper annotations that can confuse or mislead models when training on the entire dataset. While other text generation tasks can mitigate this issue by learning on large-scale datasets, the scarcity of 3D scene data enlarges the negative effect of misleading annotations. Although active learning strategies can select valuable instances for training, they fail to identify and resolve misleading labels, which the oracle inevitably provides in practice. To address this issue, we propose a multi-turn interactive active learning strategy. This strategy selects data based on models' semantic uncertainty to form a solid knowledge foundation more effectively and actively requests reannotation from an oracle to resolve potentially misleading labels. For uncertainty assessment, we utilize a variance-based metric that takes semantic relationships between terms into consideration, thus avoiding the uniform inter-class similarity assumption of previous assessment metrics. Extensive experiments exhibit better model performance and a substantial reduction in training costs, with a halving of training costs for achieving relatively high accuracy. The code is available at https://github.com/fz-zsl/AQuA.
Chinese: 该研究提出了一种多轮交互式主动学习策略,通过基于语义不确定性选择数据并请求重新标注来解决误导性标签问题,显著提升了3D视觉问答的模型性能并减半了训练成本。
English: The study introduces a multi-turn interactive active learning strategy for 3D Visual Question Answering that selects data based on semantic uncertainty and requests reannotation to address misleading labels, significantly improving model performance and halving training costs.
Authors:Jinpeng Chen, Jianxiang He, Huan Li, Senzhang Wang, Yuan Cao, Kaimin Wei, Zhenye Yang, Ye Ji
Abstract:
Session-based Recommendation (SBR) aims to predict the next item a user will likely engage with, using their interaction sequence within an anonymous session. Existing SBR models often focus only on single-session information, ignoring inter-session relationships and valuable cross-session insights. Some methods try to include inter-session data but struggle with noise and irrelevant information, reducing performance. Additionally, most models rely on item ID co-occurrence and overlook rich semantic details, limiting their ability to capture fine-grained item features. To address these challenges, we propose a novel hierarchical intent-guided optimization approach with pluggable LLM-driven semantic learning for session-based recommendations, called HIPHOP. First, we introduce a pluggable embedding module based on large language models (LLMs) to generate high-quality semantic representations, enhancing item embeddings. Second, HIPHOP utilizes graph neural networks (GNNs) to model item transition relationships and incorporates a dynamic multi-intent capturing module to address users' diverse interests within a session. Additionally, we design a hierarchical inter-session similarity learning module, guided by user intent, to capture global and local session relationships, effectively exploring users' long-term and short-term interests. To mitigate noise, an intent-guided denoising strategy is applied during inter-session learning. Finally, we enhance the model's discriminative capability by using contrastive learning to optimize session representations. Experiments on multiple datasets show that HIPHOP significantly outperforms existing methods, demonstrating its effectiveness in improving recommendation quality. Our code is available: https://github.com/hjx159/HIPHOP.
中文: HIPHOP模型通过引入大语言模型语义嵌入、动态多意图捕捉及分层去噪的会话间学习,有效提升了基于会话的推荐性能,显著优于现有方法。
English: The proposed HIPHOP model enhances session-based recommendations by integrating LLM-driven semantic embeddings, dynamic multi-intent capturing, and hierarchical inter-session learning with denoising, significantly outperforming existing methods.
Authors:Mostafa Elhoushi, Jeff Johnson
Abstract:
We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4 .
中文:any4是一种针对大语言模型的学习型4位量化方法,无需权重或激活预处理即可在不同模型上实现更高精度,同时提供了优化的GPU库和高效的单样本校准方案。
English: any4 is a learned 4-bit quantization method for LLMs that achieves superior accuracy across various models without requiring weight or activation preprocessing, while also introducing an optimized GPU library and efficient single-sample calibration.
Authors:Zien Wang, Xiucheng Wang, Nan Cheng, Wenchao Xu, Wei Quan, Ruijin Sun, Conghao Zhou
Abstract:
The exponential growth of multimedia data traffic in 6G networks poses unprecedented challenges for immersive communication, where ultra-high-definition, multi-quality streaming must be delivered on demand while minimizing network operational costs. Traditional routing approaches, such as shortest-path algorithms, fail to optimize flow multiplexing across multiple destinations, while conventional Steiner tree methods cannot accommodate heterogeneous quality-of-service (QoS) requirements-a critical need for 6G's personalized services. In this paper, we address a fundamental but unsolved challenge: the minimum flow problem (MFP) with multi-destination, heterogeneous outflow demands, which is pivotal for efficient multimedia distribution such as adaptive-resolution video streaming. To overcome the limitations of existing methods, we propose a two-stage dynamic programming-enhanced On-demand Steiner Tree (OST) algorithm, the first approach that jointly optimizes flow aggregation and QoS-aware path selection for arbitrary outflow requirements. We rigorously prove the optimality of OST using mathematical induction, demonstrating that it guarantees the minimum-cost multicast flow under differentiated service constraints. Extensive experiments in 6G-like multimedia transmission scenarios show that OST reduces total network flow by over 10% compared to state-of-the-art methods while ensuring on-demand QoS fulfillment. The complete code is available at https://github.com/UNIC-Lab/OST.
中文: 本文提出了一种动态规划增强的按需斯坦纳树(OST)算法,解决了6G网络中多目的地异构流出需求的最小流问题,在保证服务质量的同时将网络总流量降低了10%以上。
English: The paper introduces a dynamic programming-enhanced On-demand Steiner Tree (OST) algorithm to solve the minimum flow problem with multi-destination heterogeneous outflow demands in 6G networks, achieving over 10% network flow reduction while ensuring quality-of-service requirements.
Authors:Rushil Thareja, Preslav Nakov, Praneeth Vepakomma, Nils Lukas
Abstract:
Large language models (LLMs) can leak sensitive information from their context through generated outputs, either accidentally or when prompted adversarially. Existing defenses that aim to preserve context privacy during inference either lack formal guarantees or suffer from a poor utility/privacy trade-off. We propose DP-Fusion, a token-level Differentially Private Inference (DPI) mechanism that provably bounds how much an LLM's outputs reveal about sensitive tokens in its context. We demonstrate DPI through the task of document privatization, where the goal is to paraphrase documents so that sensitive content (e.g., Personally Identifiable Information, PII) cannot be reliably inferred, while still preserving the overall utility of the text. This is controlled by a parameter $ε$: $ε=0$ hides PII entirely, while higher values trade off privacy for improved paraphrase quality. DP-Fusion works as follows: (i) partition sensitive tokens into disjoint privacy groups, (ii) run the LLM once per group, and (iii) blend the output distributions so that the final output remains within a fixed statistical distance of the baseline distribution produced when no privacy group is revealed. This approach allows fine-grained control over the privacy/utility trade-off but requires multiple LLM forward passes.
中文: DP-Fusion是一种差分隐私推理机制,通过将敏感令牌分组并融合其输出分布来保护大语言模型中的敏感信息,以多次模型前向传递为代价实现可调节的隐私与效用平衡。
English: DP-Fusion is a differentially private inference mechanism that protects sensitive information in LLM outputs by partitioning tokens into privacy groups and blending their distributions, offering adjustable privacy-utility trade-offs through multiple model passes.
Authors:Xinhua Lu, Runhe Lai, Yanqi Wu, Kanghao Chen, Wei-Shi Zheng, Ruixuan Wang
Abstract:
Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external large-scale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative CLIP-based framework based on Forced prompt leArning (FA), designed to make full use of the In-Distribution (ID) knowledge and ultimately boost the effectiveness of OOD detection. Our key insight is to learn a prompt (i.e., forced prompt) that contains more diversified and richer descriptions of the ID classes beyond the textual semantics of class labels. Specifically, it promotes better discernment for ID images, by forcing more notable semantic similarity between ID images and the learnable forced prompt. Moreover, we introduce a forced coefficient, encouraging the forced prompt to learn more comprehensive and nuanced descriptions of the ID classes. In this way, FA is capable of achieving notable improvements in OOD detection, even when trained without any external auxiliary datasets, while maintaining an identical number of trainable parameters as CoOp. Extensive empirical evaluations confirm our method consistently outperforms current state-of-the-art methods. Code is available at https://github.com/0xFAFA/FA.
中文: 本研究提出了一种强制提示学习框架,通过利用多样化的类别描述来增强分布外检测,无需外部数据集即可实现卓越性能。
English: This study introduces a forced prompt learning framework that enhances out-of-distribution detection by leveraging in-distribution knowledge through diversified class descriptions, achieving superior performance without external datasets.
Authors:Xinhua Lu, Runhe Lai, Yanqi Wu, Kanghao Chen, Wei-Shi Zheng, Ruixuan Wang
Abstract:
Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external large-scale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative CLIP-based framework based on Forced prompt leArning (FA), designed to make full use of the In-Distribution (ID) knowledge and ultimately boost the effectiveness of OOD detection. Our key insight is to learn a prompt (i.e., forced prompt) that contains more diversified and richer descriptions of the ID classes beyond the textual semantics of class labels. Specifically, it promotes better discernment for ID images, by forcing more notable semantic similarity between ID images and the learnable forced prompt. Moreover, we introduce a forced coefficient, encouraging the forced prompt to learn more comprehensive and nuanced descriptions of the ID classes. In this way, FA is capable of achieving notable improvements in OOD detection, even when trained without any external auxiliary datasets, while maintaining an identical number of trainable parameters as CoOp. Extensive empirical evaluations confirm our method consistently outperforms current state-of-the-art methods. Code is available at https://github.com/0xFAFA/FA.
中文: 本研究提出了一种强制提示学习框架,通过利用多样化的类别描述来增强分布外检测,无需外部数据集即可实现卓越性能。
English: This study introduces a forced prompt learning framework that enhances out-of-distribution detection by leveraging in-distribution knowledge through diversified class descriptions, achieving superior performance without external datasets.
Authors:Yikang Zhao, Feng Gao, Xuepeng Jin, Junyu Dong, Qian Du
Abstract:
Multi-source data classification is a critical yet challenging task for remote sensing image interpretation. Existing methods lack adaptability to diverse land cover types when modeling frequency domain features. To this end, we propose a Dynamic Frequency Feature Fusion Network (DFFNet) for hyperspectral image (HSI) and Synthetic Aperture Radar (SAR) / Light Detection and Ranging (LiDAR) data joint classification. Specifically, we design a dynamic filter block to dynamically learn the filter kernels in the frequency domain by aggregating the input features. The frequency contextual knowledge is injected into frequency filter kernels. Additionally, we propose spectral-spatial adaptive fusion block for cross-modal feature fusion. It enhances the spectral and spatial attention weight interactions via channel shuffle operation, thereby providing comprehensive cross-modal feature fusion. Experiments on two benchmark datasets show that our DFFNet outperforms state-of-the-art methods in multi-source data classification. The codes will be made publicly available at https://github.com/oucailab/DFFNet.
Chinese: 提出的动态频率特征融合网络(DFFNet)通过动态学习频域滤波器和融合跨模态特征,显著提升了多源遥感数据的分类性能,在基准数据集上超越了现有先进方法。
English: The proposed Dynamic Frequency Feature Fusion Network (DFFNet) enhances multi-source remote sensing classification by dynamically learning frequency domain filters and integrating cross-modal features, demonstrating superior performance over existing methods.
Authors:Xujia Wang, Yunjia Qi, Bin Xu
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, significantly reduce the number of trainable parameters by introducing low-rank decomposition matrices. However, existing methods perform extensive matrix multiplications in domain specialization tasks, resulting in computational inefficiency and sub-optimal fine-tuning performance. Hence, we propose LoSiA(Low-Resources Subnet Integration Adaptation), an innovative method that dynamically localizes and optimizes critical parameters during the training process. Specifically, it identifies a sub-network using gradient sparsity analysis and optimizes it as the trainable target. This design enables effective high-rank adaptation by updating only the sub-network parameters, reducing the additional matrix multiplication. We also present LoSiA-Pro, a faster implementation of LoSiA, which reduces the training latency by about $27\%$ compared to LoRA. Extensive evaluations show that our method achieves minimal performance drop compared to full fine-tuning, while requiring the least training time across domain specialization and common-sense reasoning tasks. Further analysis shows that LoSiA also reduces forgetting during continued training. The source code is available at https://github.com/KlozeWang/LoSiA.
中文摘要:LoSiA提出了一种创新的参数高效微调方法,通过梯度稀疏分析动态优化关键子网络,在降低计算成本和训练时间的同时实现了接近全参数微调的性能。
English Summary: LoSiA introduces a novel parameter-efficient fine-tuning approach that dynamically optimizes critical sub-networks through gradient sparsity analysis, achieving near-full fine-tuning performance with reduced computational cost and training time.
Authors:Ashish Bastola, Mert D. Pesé, Long Cheng, Jonathon Smereka, Abolfazl Razi
Abstract:
Anomaly detection plays a critical role in Autonomous Vehicles (AVs) by identifying unusual behaviors through perception systems that could compromise safety and lead to hazardous situations. Current approaches, which often rely on predefined thresholds or supervised learning paradigms, exhibit reduced efficacy when confronted with unseen scenarios, sensor noise, and occlusions, leading to potential safety-critical failures. Moreover, supervised methods require large annotated datasets, limiting their real-world feasibility. To address these gaps, we propose an anomaly detection framework based on Inverse Reinforcement Learning (IRL) to infer latent driving intentions from sequential perception data, thus enabling robust identification. Specifically, we present Trajectory-Reward Guided Adaptive Pre-training (TRAP), a novel IRL framework for anomaly detection, to address two critical limitations of existing methods: noise robustness and generalization to unseen scenarios. Our core innovation is implicitly learning temporal credit assignments via reward and worst-case supervision. We leverage pre-training with variable-horizon sampling to maximize time-to-consequence, resulting in early detection of behavior deviation. Experiments on 14,000+ simulated trajectories demonstrate state-of-the-art performance, achieving 0.90 AUC and 82.2\% F1-score - outperforming similarly trained supervised and unsupervised baselines by 39\% on Recall and 12\% on F1-score, respectively. Similar performance is achieved while exhibiting robustness to various noise types and generalization to unseen anomaly types. Our code will be available at: https://github.com/abastola0/TRAP.git
中文: 提出的TRAP框架利用逆向强化学习从感知数据中推断驾驶意图,实现了对自动驾驶汽车异常的鲁棒检测,以0.90的AUC值和更强的噪声适应性展现出卓越性能。
English: The proposed TRAP framework utilizes Inverse Reinforcement Learning to robustly detect anomalies in autonomous vehicles by inferring driving intentions from perception data, achieving superior performance with 0.90 AUC and enhanced noise resilience.
Authors:Feiyue Wu, Tianxing Wu, Shenqi Jing
Abstract:
Medication recommendation is a crucial task in healthcare, especially for patients with complex medical conditions. However, existing methods often struggle to effectively balance the reuse of historical medications with the introduction of new drugs in response to the changing patient conditions. In order to address this challenge, we propose an Adaptively Responsive network for Medication Recommendation (ARMR), a new method which incorporates 1) a piecewise temporal learning component that distinguishes between recent and distant patient history, enabling more nuanced temporal understanding, and 2) an adaptively responsive mechanism that dynamically adjusts attention to new and existing drugs based on the patient's current health state and medication history. Experiments on the MIMIC-III and MIMIC-IV datasets indicate that ARMR has better performance compared with the state-of-the-art baselines in different evaluation metrics, which contributes to more personalized and accurate medication recommendations. The source code is publicly avaiable at: https://github.com/seucoin/armr2.
中文: 提出的ARMR方法通过时序学习和自适应响应机制,动态平衡历史用药与新药引入,在标准数据集上展现出更优的个性化药物推荐性能。
English: The proposed ARMR method enhances medication recommendations by dynamically balancing historical and new drug considerations through temporal learning and adaptive mechanisms, demonstrating superior performance on benchmark datasets.
Authors:Zexin Deng, Zhenhui Yuan, Longhao Zou
Abstract:
Telerobotic technologies are becoming increasingly essential in fields such as remote surgery, nuclear decommissioning, and space exploration. Reliable datasets and testbeds are essential for evaluating telerobotic system performance prior to real-world deployment. However, there is a notable lack of datasets that capture the impact of network delays, as well as testbeds that realistically model the communication link between the operator and the robot. This paper introduces TeleSim, a network-aware teleoperation dataset and testbed designed to assess the performance of telerobotic applications under diverse network conditions. TeleSim systematically collects performance data from fine manipulation tasks executed under three predefined network quality tiers: High, Medium, and Low. Each tier is characterized through controlled settings of bandwidth, latency, jitter, and packet loss. Using OMNeT++ for precise network simulation, we record a wide range of metrics, including completion time, success rates, video quality indicators (Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM)), and quality of service (QoS) parameters. TeleSim comprises 300 experimental trials, providing a robust benchmark for evaluating teleoperation systems across heterogeneous network scenarios. In the worst network condition, completion time increases by 221.8% and success rate drops by 64%. Our findings reveal that network degradation leads to compounding negative impacts, notably reduced video quality and prolonged task execution, highlighting the need for adaptive, resilient teleoperation protocols. The full dataset and testbed software are publicly available on our GitHub repository: https://github.com/ConnectedRoboticsLab and YouTube channel: https://youtu.be/Fz_1iOYe104.
中文: 本文介绍了TeleSim这一网络感知数据集与测试平台,系统评估了不同网络条件下远程机器人操作的性能表现,揭示了网络延迟导致的严重性能下降,为开发适应性远程操作系统提供了基准。
English: This paper introduces TeleSim, a network-aware dataset and testbed that systematically evaluates telerobotic performance under varying network conditions, revealing significant performance degradation with network delays and providing a benchmark for developing resilient teleoperation protocols.
Authors:Xiuying Wei, Anunay Yadav, Razvan Pascanu, Caglar Gulcehre
Abstract:
Transformers have become the cornerstone of modern large-scale language models, but their reliance on softmax attention poses a computational bottleneck at both training and inference. Recurrent models offer high efficiency, but compressing the full sequence into a fixed-size and holistic representation suffers from memory degradation in long contexts and limits fine-grained retrieval. To address this, we propose RAT, an intermediate design that bridges the efficiency of RNNs and capacity of attention. RAT partitions the input into chunks, applies recurrence within each chunk for local dependencies, and softmax-based attention across chunks for long-range interactions. This design mitigates memory degradation and enables direct access to distant tokens, while retaining computational efficiency. Empirically, with a chunk size of 16, the RAT block achieves a 7x improvement in training speed with 100K token sequences and 9x in generation at the 4K position, while maintaining similar performance compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves RAT with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage, but also consistently enhances performance and shows the overall best results. Code is available at https://github.com/CLAIRE-Labo/RAT.
中文: 提出的RAT模型通过分块内使用循环处理局部依赖、分块间使用注意力处理长程交互,结合了RNN的高效性和注意力的强大能力,在保持性能的同时显著提升了处理速度。
English: The proposed RAT model combines the efficiency of RNNs and the capacity of attention by using recurrence within chunks for local dependencies and softmax attention across chunks for long-range interactions, achieving significant speed improvements while maintaining performance.
Authors:You Zhou, Lijiang Chen, Guangxia Cui, Wenpei Bai, Yu Guo, Shuchang Lyu, Guangliang Cheng, Qi Zhao
Abstract:
Ovarian tumor, as a common gynecological disease, can rapidly deteriorate into serious health crises when undetected early, thus posing significant threats to the health of women. Deep neural networks have the potential to identify ovarian tumors, thereby reducing mortality rates, but limited public datasets hinder its progress. To address this gap, we introduce a vital ovarian tumor pathological recognition dataset called \textbf{ViTaL} that contains \textbf{V}isual, \textbf{T}abular and \textbf{L}inguistic modality data of 496 patients across six pathological categories. The ViTaL dataset comprises three subsets corresponding to different patient data modalities: visual data from 2216 two-dimensional ultrasound images, tabular data from medical examinations of 496 patients, and linguistic data from ultrasound reports of 496 patients. It is insufficient to merely distinguish between benign and malignant ovarian tumors in clinical practice. To enable multi-pathology classification of ovarian tumor, we propose a ViTaL-Net based on the Triplet Hierarchical Offset Attention Mechanism (THOAM) to minimize the loss incurred during feature fusion of multi-modal data. This mechanism could effectively enhance the relevance and complementarity between information from different modalities. ViTaL-Net serves as a benchmark for the task of multi-pathology, multi-modality classification of ovarian tumors. In our comprehensive experiments, the proposed method exhibited satisfactory performance, achieving accuracies exceeding 90\% on the two most common pathological types of ovarian tumor and an overall performance of 85\%. Our dataset and code are available at https://github.com/GGbond-study/vitalnet.
中文: 为解决卵巢肿瘤检测公共数据集不足的问题,本研究推出了ViTaL数据集,整合了496名患者的视觉、表格和语言数据,并提出基于三重分层偏移注意力机制的ViTaL-Net模型,在多病理分类中实现了高精度。
English: To address the limited public datasets for ovarian tumor detection, this study introduces the ViTaL dataset, which integrates visual, tabular, and linguistic data from 496 patients, and proposes ViTaL-Net, a model using a Triplet Hierarchical Offset Attention Mechanism to achieve high accuracy in multi-pathology classification.
Authors:Zhipeng Li, Kegang Wang, Hanguang Xiao, Xingyue Liu, Feizhong Zhou, Jiaxin Jiang, Tianqi Liu
Abstract:
Remote photoplethysmography (rPPG) is a non-contact technique for measuring human physiological signals. Due to its convenience and non-invasiveness, it has demonstrated broad application potential in areas such as health monitoring and emotion recognition. In recent years, the release of numerous public datasets has significantly advanced the performance of rPPG algorithms under ideal lighting conditions. However, the effectiveness of current rPPG methods in realistic nighttime scenarios with dynamic lighting variations remains largely unknown. Moreover, there is a severe lack of datasets specifically designed for such challenging environments, which has substantially hindered progress in this area of research. To address this gap, we present and release a large-scale rPPG dataset collected under dynamic lighting conditions at night, named DLCN. The dataset comprises approximately 13 hours of video data and corresponding synchronized physiological signals from 98 participants, covering four representative nighttime lighting scenarios. DLCN offers high diversity and realism, making it a valuable resource for evaluating algorithm robustness in complex conditions. Built upon the proposed Happy-rPPG Toolkit, we conduct extensive experiments and provide a comprehensive analysis of the challenges faced by state-of-the-art rPPG methods when applied to DLCN. The dataset and code are publicly available at https://github.com/dalaoplan/Happp-rPPG-Toolkit.
Chinese: 远程光电容积描记术(rPPG)是一种有前景的非接触式生理监测方法,但由于缺乏专门的数据集,其在动态夜间条件下的有效性尚不明确,因此我们发布了DLCN数据集以评估算法的鲁棒性。
English: Remote photoplethysmography (rPPG) is a promising non-contact method for physiological monitoring, but its effectiveness in dynamic nighttime conditions remains unclear due to a lack of specialized datasets, prompting the release of the DLCN dataset to evaluate algorithm robustness.
Authors:Roy Uziel, Irit Chelly, Oren Freifeld, Ari Pakman
Abstract:
Diffusion models, widely recognized for their success in generative tasks, have not yet been applied to clustering. We introduce Clustering via Diffusion (CLUDI), a self-supervised framework that combines the generative power of diffusion models with pre-trained Vision Transformer features to achieve robust and accurate clustering. CLUDI is trained via a teacher-student paradigm: the teacher uses stochastic diffusion-based sampling to produce diverse cluster assignments, which the student refines into stable predictions. This stochasticity acts as a novel data augmentation strategy, enabling CLUDI to uncover intricate structures in high-dimensional data. Extensive evaluations on challenging datasets demonstrate that CLUDI achieves state-of-the-art performance in unsupervised classification, setting new benchmarks in clustering robustness and adaptability to complex data distributions. Our code is available at https://github.com/BGU-CS-VIL/CLUDI.
中文: CLUDI是一种自监督聚类框架,通过教师-学生范式结合扩散模型和视觉Transformer特征,在无监督分类中实现了最先进的性能,能够揭示复杂的数据结构。
English: CLUDI is a self-supervised clustering framework that leverages diffusion models and Vision Transformer features through a teacher-student paradigm, achieving state-of-the-art performance in unsupervised classification by uncovering complex data structures.
Authors:Gwok-Waa Wan, Shengchu Su, Ruihu Wang, Qixiang Chen, Sam-Zaak Wong, Mengnv Xing, Hefei Feng, Yubo Wang, Yinan Zhu, Jingyi Zhang, Jianmin Ye, Xinlai Wan, Tao Ni, Qiang Xu, Nan Guan, Zhe Jiang, Xi Wang, Yang Jun
Abstract:
Despite the transformative potential of Large Language Models (LLMs) in hardware design, a comprehensive evaluation of their capabilities in design verification remains underexplored. Current efforts predominantly focus on RTL generation and basic debugging, overlooking the critical domain of functional verification, which is the primary bottleneck in modern design methodologies due to the rapid escalation of hardware complexity. We present FIXME, the first end-to-end, multi-model, and open-source evaluation framework for assessing LLM performance in hardware functional verification (FV) to address this crucial gap. FIXME introduces a structured three-level difficulty hierarchy spanning six verification sub-domains and 180 diverse tasks, enabling in-depth analysis across the design lifecycle. Leveraging a collaborative AI-human approach, we construct a high-quality dataset using 100% silicon-proven designs, ensuring comprehensive coverage of real-world challenges. Furthermore, we enhance the functional coverage by 45.57% through expert-guided optimization. By rigorously evaluating state-of-the-art LLMs such as GPT-4, Claude3, and LlaMA3, we identify key areas for improvement and outline promising research directions to unlock the full potential of LLM-driven automation in hardware design verification. The benchmark is available at https://github.com/ChatDesignVerification/FIXME.
中文: 本文提出首个开源大语言模型硬件功能验证评估框架FIXME,通过构建多层级任务体系和专家优化的数据集显著提升功能覆盖率,并指明模型改进的关键方向。
English: This paper introduces FIXME, the first open-source framework for evaluating LLMs in hardware functional verification, featuring a multi-level task hierarchy and expert-optimized dataset that significantly improves coverage while identifying key areas for model enhancement.
Authors:Mohammadreza Sharifi, Ahad Harati
Abstract:
Effective data curation is essential for optimizing neural network training. In this paper, we present the Guided Spectrally Tuned Data Selection (GSTDS) algorithm, which dynamically adjusts the subset of data points used for training using an off-the-shelf pre-trained reference model. Based on a pre-scheduled filtering ratio, GSTDS effectively reduces the number of data points processed per batch. The proposed method ensures an efficient selection of the most informative data points for training while avoiding redundant or less beneficial computations. Preserving data points in each batch is performed based on spectral analysis. A Fiedler vector-based scoring mechanism removes the filtered portion of the batch, lightening the resource requirements of the learning. The proposed data selection approach not only streamlines the training process but also promotes improved generalization and accuracy. Extensive experiments on standard image classification benchmarks, including CIFAR-10, Oxford-IIIT Pet, and Oxford-Flowers, demonstrate that GSTDS outperforms standard training scenarios and JEST, a recent state-of-the-art data curation method, on several key factors. It is shown that GSTDS achieves notable reductions in computational requirements, up to four times, without compromising performance. GSTDS exhibits a considerable growth in terms of accuracy under the limited computational resource usage, in contrast to other methodologies. These promising results underscore the potential of spectral-based data selection as a scalable solution for resource-efficient deep learning and motivate further exploration into adaptive data curation strategies. You can find the code at https://github.com/rezasharifi82/GSTDS.
中文: GSTDS算法通过谱分析动态选择神经网络训练中最具信息量的数据点,在多个基准测试中将计算需求显著降低达四倍,同时提高了准确性和泛化能力。
English: The GSTDS algorithm dynamically selects the most informative data points for neural network training using spectral analysis, significantly reducing computational requirements by up to four times while improving accuracy and generalization across multiple benchmarks.
Authors:Liwen Xiao, Zhiyu Pan, Zhicheng Wang, Zhiguo Cao, Wei Li
Abstract:
Accurate prediction of multi-agent future trajectories is crucial for autonomous driving systems to make safe and efficient decisions. Trajectory refinement has emerged as a key strategy to enhance prediction accuracy. However, existing refinement methods often overlook the topological relationships between trajectories, which are vital for improving prediction precision. Inspired by braid theory, we propose a novel trajectory refinement approach, Soft-Braid Refiner (SRefiner), guided by the soft-braid topological structure of trajectories using Soft-Braid Attention. Soft-Braid Attention captures spatio-temporal topological relationships between trajectories by considering both spatial proximity and vehicle motion states at ``soft intersection points". Additionally, we extend this approach to model interactions between trajectories and lanes, further improving the prediction accuracy. SRefiner is a multi-iteration, multi-agent framework that iteratively refines trajectories, incorporating topological information to enhance interactions within traffic scenarios. SRefiner achieves significant performance improvements over four baseline methods across two datasets, establishing a new state-of-the-art in trajectory refinement. Code is here https://github.com/Liwen-Xiao/SRefiner.
中文: 提出的Soft-Braid Refiner(SRefiner)通过Soft-Braid注意力捕捉轨迹间的时空拓扑关系,显著提升了多智能体轨迹预测精度,在多个数据集上实现了最先进的性能。
English: The proposed Soft-Braid Refiner (SRefiner) enhances multi-agent trajectory prediction accuracy by capturing spatio-temporal topological relationships through Soft-Braid Attention, achieving state-of-the-art performance across multiple datasets.
Authors:Qiang Heng, Caixing Wang
Abstract:
First-order methods in convex optimization offer low per-iteration cost but often suffer from slow convergence, while second-order methods achieve fast local convergence at the expense of costly Hessian inversions. In this paper, we highlight a middle ground: minimizing a quadratic majorant with fixed curvature at each iteration. This strategy strikes a balance between per-iteration cost and convergence speed, and crucially allows the reuse of matrix decompositions, such as Cholesky or spectral decompositions, across iterations and varying regularization parameters. We introduce the Quadratic Majorization Minimization with Extrapolation (QMME) framework and establish its sequential convergence properties under standard assumptions. The new perspective of our analysis is to center the arguments around the induced norm of the curvature matrix $H$. To demonstrate practical advantages, we apply QMME to large-scale kernel regularized learning problems. In particular, we propose a novel Sylvester equation modelling technique for kernel multinomial regression. In Julia-based experiments, QMME compares favorably against various established first- and second-order methods. Furthermore, we demonstrate that our algorithms complement existing kernel approximation techniques through more efficiently handling sketching matrices with large projection dimensions. Our numerical experiments and real data analysis are available and fully reproducible at https://github.com/qhengncsu/QMME.jl.
Chinese: 本文提出外推二次主化最小化(QMME)框架,通过固定曲率最小化二次主化函数,在单步计算成本与收敛速度间取得平衡,并支持跨迭代和正则化参数的矩阵分解重用。
English: This paper introduces the Quadratic Majorization Minimization with Extrapolation (QMME) framework, which balances low per-iteration cost and fast convergence by minimizing quadratic majorants with fixed curvature, enabling efficient reuse of matrix decompositions across iterations and regularization parameters.
Authors:Xinbo Wang, Wenju Xu, Qing Zhang, Wei-Shi Zheng
Abstract:
This paper presents a portrait style transfer method that generalizes well to various different domains while enabling high-quality semantic-aligned stylization on regions including hair, eyes, eyelashes, skins, lips, and background. To this end, we propose to establish dense semantic correspondence between the given input and reference portraits based on a pre-trained model and a semantic adapter, with which we obtain a warped reference semantically aligned with the input. To ensure effective yet controllable style transfer, we devise an AdaIN-Wavelet transform to balance content preservation and stylization by blending low-frequency information of the warped reference with high-frequency information of the input in the latent space. A style adapter is also designed to provide style guidance from the warped reference. With the stylized latent from AdaIN-Wavelet transform, we employ a dual-conditional diffusion model that integrates a ControlNet recording high-frequency information and the style guidance to generate the final result. Extensive experiments demonstrate the superiority of our method. Our code and trained model are available at https://github.com/wangxb29/DGPST.
中文: 本文提出一种人像风格迁移方法,通过建立密集语义对应和使用AdaIN小波变换,在多个面部区域实现高质量语义对齐的风格化,同时有效平衡内容保持与风格化效果。
English: This paper introduces a portrait style transfer method that achieves high-quality, semantically aligned stylization across multiple facial regions by establishing dense semantic correspondence and using an AdaIN-Wavelet transform to balance content preservation with style application.
Authors:Md Rashidunnabi, Fahmida Faiza Ananna, Kailash Hambarde, Bruno Gabriel Nascimento Andrade, Dean Venables, Hugo Proenca
Abstract:
Air pollution poses a critical health threat in cities worldwide, with nitrogen dioxide levels in Cork, Ireland exceeding World Health Organization safety standards by up to $278\%$. This study leverages artificial intelligence to predict air pollution with unprecedented accuracy, analyzing nearly ten years of data from five monitoring stations combined with 30 years of weather records. We evaluated 17 machine learning algorithms, with Extra Trees emerging as the optimal solution, achieving $77\%$ prediction accuracy and significantly outperforming traditional forecasting methods. Our analysis reveals that meteorological conditions particularly temperature, wind speed, and humidity are the primary drivers of pollution levels, while traffic patterns and seasonal changes create predictable pollution cycles. Pollution exhibits dramatic seasonal variations, with winter levels nearly double those of summer, and daily rush-hour peaks reaching $120\%$ above normal levels. While Cork's air quality shows concerning violations of global health standards, our models detected an encouraging $31\%$ improvement from 2014 to 2022. This research demonstrates that intelligent forecasting systems can provide city planners and environmental officials with powerful prediction tools, enabling life-saving early warning systems and informed urban planning decisions. The technology exists today to transform urban air quality management. All research materials and code are freely available at: https://github.com/MdRashidunnabi/Air-Pollution-Analysis.git
中文摘要:本研究利用人工智能精确预测爱尔兰科克市的空气污染,确定气象条件为主要驱动因素,并显示2014至2022年间空气质量改善31%,为城市规划提供了重要工具。
English Summary: This study uses AI to accurately predict air pollution in Cork, Ireland, identifying weather conditions as key drivers and revealing a 31% air quality improvement from 2014 to 2022, providing vital tools for urban planning.
Authors:Costas Mavromatis, Soji Adeshina, Vassilis N. Ioannidis, Zhen Han, Qi Zhu, Ian Robinson, Bryan Thompson, Huzefa Rangwala, George Karypis
Abstract:
Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom ("bring-your-own") KGs. We introduce BYOKG-RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG-RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG-RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG-RAG outperforms the second-best graph retrieval method by 4.5% points while showing better generalization to custom KGs. BYOKG-RAG framework is open-sourced at https://github.com/awslabs/graphrag-toolkit.
Chinese: BYOKG-RAG框架通过结合大语言模型与图检索工具来生成和优化图谱元素,提供了一个更稳健的解决方案,在自定义知识图谱上表现优于现有方法4.5%,并展现出更好的泛化能力。
English: BYOKG-RAG enhances KGQA by integrating LLMs with graph tools to generate and refine graph artifacts, offering a robust solution that outperforms existing methods by 4.5% and generalizes better to custom knowledge graphs.
Authors:Linshen Liu, Boyan Su, Junyue Jiang, Guanlin Wu, Cong Guo, Ceyu Xu, Hao Frank Yang
Abstract:
This paper presents Edge-based Mixture of Experts (MoE) Collaborative Computing (EMC2), an optimal computing system designed for autonomous vehicles (AVs) that simultaneously achieves low-latency and high-accuracy 3D object detection. Unlike conventional approaches, EMC2 incorporates a scenario-aware MoE architecture specifically optimized for edge platforms. By effectively fusing LiDAR and camera data, the system leverages the complementary strengths of sparse 3D point clouds and dense 2D images to generate robust multimodal representations. To enable this, EMC2 employs an adaptive multimodal data bridge that performs multi-scale preprocessing on sensor inputs, followed by a scenario-aware routing mechanism that dynamically dispatches features to dedicated expert models based on object visibility and distance. In addition, EMC2 integrates joint hardware-software optimizations, including hardware resource utilization optimization and computational graph simplification, to ensure efficient and real-time inference on resource-constrained edge devices. Experiments on open-source benchmarks clearly show the EMC2 advancements as an end-to-end system. On the KITTI dataset, it achieves an average accuracy improvement of 3.58% and a 159.06% inference speedup compared to 15 baseline methods on Jetson platforms, with similar performance gains on the nuScenes dataset, highlighting its capability to advance reliable, real-time 3D object detection tasks for AVs. The official implementation is available at https://github.com/LinshenLiu622/EMC2.
中文: EMC2是一种面向自动驾驶车辆的边缘优化计算系统,通过场景感知专家混合架构融合激光雷达与相机数据,结合软硬件协同优化,实现了高精度三维物体检测与低延迟性能。
English: EMC2 is an edge-optimized computing system for autonomous vehicles that achieves high-accuracy 3D object detection with low latency by fusing LiDAR and camera data through a scenario-aware mixture of experts architecture and joint hardware-software optimizations.
Authors:Ziming Hong, Runnan Chen, Zengmao Wang, Bo Han, Bo Du, Tongliang Liu
Abstract:
Data-free knowledge distillation (DFKD) transfers knowledge from a teacher to a student without access the real in-distribution (ID) data. Its common solution is to use a generator to synthesize fake data and use them as a substitute for real ID data. However, existing works typically assume teachers are trustworthy, leaving the robustness and security of DFKD from untrusted teachers largely unexplored. In this work, we conduct the first investigation into distilling non-transferable learning (NTL) teachers using DFKD, where the transferability from an ID domain to an out-of-distribution (OOD) domain is prohibited. We find that NTL teachers fool DFKD through divert the generator's attention from the useful ID knowledge to the misleading OOD knowledge. This hinders ID knowledge transfer but prioritizes OOD knowledge transfer. To mitigate this issue, we propose Adversarial Trap Escaping (ATEsc) to benefit DFKD by identifying and filtering out OOD-like synthetic samples. Specifically, inspired by the evidence that NTL teachers show stronger adversarial robustness on OOD samples than ID samples, we split synthetic samples into two groups according to their robustness. The fragile group is treated as ID-like data and used for normal knowledge distillation, while the robust group is seen as OOD-like data and utilized for forgetting OOD knowledge. Extensive experiments demonstrate the effectiveness of ATEsc for improving DFKD against NTL teachers. Code is released at https://github.com/tmllab/2025_ICML_ATEsc.
中文: 本研究提出对抗性陷阱逃逸方法,通过识别和过滤分布外合成样本,有效应对不可迁移学习教师,提升无数据知识蒸馏的鲁棒性和知识传递效果。
English: This study introduces Adversarial Trap Escaping (ATEsc) to enhance data-free knowledge distillation by identifying and filtering out-of-distribution synthetic samples, effectively countering non-transferable learning teachers and improving knowledge transfer robustness.
Authors:Wenyang Liu, Chen Cai, Jianjun Gao, Kejun Wu, Yi Wang, Kim-Hui Yap, Lap-Pui Chau
Abstract:
Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. To address this challenge, we propose PromptSR, a novel prompt-empowered lightweight image SR method. The core component is the proposed cascade prompting block (CPB), which enhances global information access and local refinement via three cascaded prompting layers: a global anchor prompting layer (GAPL) and two local prompting layers (LPLs). The GAPL leverages downscaled features as anchors to construct low-dimensional anchor prompts (APs) through cross-scale attention, significantly reducing computational costs. These APs, with enhanced global perception, are then used to provide global prompts, efficiently facilitating long-range token connections. The two LPLs subsequently combine category-based self-attention and window-based self-attention to refine the representation in a coarse-to-fine manner. They leverage attention maps from the GAPL as additional global prompts, enabling them to perceive features globally at different granularities for adaptive local refinement. In this way, the proposed CPB effectively combines global priors and local details, significantly enlarging the receptive field while maintaining the low computational costs of our PromptSR. The experimental results demonstrate the superiority of our method, which outperforms state-of-the-art lightweight SR methods in quantitative, qualitative, and complexity evaluations. Our code will be released at https://github.com/wenyang001/PromptSR.
Chinese: PromptSR提出了一种级联提示块,通过全局锚点提示和局部提示层在保持低计算成本的同时有效扩大感受野,在定量、定性和复杂度评估上均优于现有轻量级图像超分辨率方法。
English: PromptSR introduces a cascade prompting block that uses global anchor prompts and local prompting layers to expand the receptive field efficiently while maintaining low computational costs, outperforming existing lightweight image super-resolution methods.
Authors:Xiaohan Zhang, Tavis Shore, Chen Chen, Oscar Mendez, Simon Hadfield, Safwan Wshah
Abstract:
In this paper, we present a high-performing solution to the UAVM 2025 Challenge, which focuses on matching narrow FOV street-level images to corresponding satellite imagery using the University-1652 dataset. As panoramic Cross-View Geo-Localisation nears peak performance, it becomes increasingly important to explore more practical problem formulations. Real-world scenarios rarely offer panoramic street-level queries; instead, queries typically consist of limited-FOV images captured with unknown camera parameters. Our work prioritises discovering the highest achievable performance under these constraints, pushing the limits of existing architectures. Our method begins by retrieving candidate satellite image embeddings for a given query, followed by a re-ranking stage that selectively enhances retrieval accuracy within the top candidates. This two-stage approach enables more precise matching, even under the significant viewpoint and scale variations inherent in the task. Through experimentation, we demonstrate that our approach achieves competitive results -specifically attaining R@1 and R@10 retrieval rates of \topone\% and \topten\% respectively. This underscores the potential of optimised retrieval and re-ranking strategies in advancing practical geo-localisation performance. Code is available at https://github.com/tavisshore/VICI.
中文: 本文提出了一种两阶段方法,通过优化的嵌入检索和选择性重排序,实现了窄视场街景图像与卫星图像的精确匹配,获得了具有竞争力的检索率。
English: This paper introduces a two-stage method for matching narrow field-of-view street images to satellite imagery, achieving competitive retrieval rates through optimized embedding retrieval and selective re-ranking.
Authors:StanisÅaw Pawlak, BartÅomiej Twardowski, Tomasz TrzciÅski, Joost van de Weijer
Abstract:
Our research addresses the overlooked security concerns related to data poisoning in continual learning (CL). Data poisoning - the intentional manipulation of training data to affect the predictions of machine learning models - was recently shown to be a threat to CL training stability. While existing literature predominantly addresses scenario-dependent attacks, we propose to focus on a more simple and realistic single-task poison (STP) threats. In contrast to previously proposed poisoning settings, in STP adversaries lack knowledge and access to the model, as well as to both previous and future tasks. During an attack, they only have access to the current task within the data stream. Our study demonstrates that even within these stringent conditions, adversaries can compromise model performance using standard image corruptions. We show that STP attacks are able to strongly disrupt the whole continual training process: decreasing both the stability (its performance on past tasks) and plasticity (capacity to adapt to new tasks) of the algorithm. Finally, we propose a high-level defense framework for CL along with a poison task detection method based on task vectors. The code is available at https://github.com/stapaw/STP.git .
中文摘要:本研究揭示了持续学习中单任务数据投毒的威胁,攻击者仅利用简单图像干扰即可破坏模型稳定性与可塑性,并提出了包含投毒检测的防御框架。
English Summary: This study highlights the threat of single-task poisoning in continual learning, where adversaries can degrade model stability and plasticity using simple image corruptions, and proposes a defense framework with poison detection methods.
Authors:Hanghui Guo, Weijie Shi, Mengze Li, Juncheng Li, Hao Chen, Yue Cui, Jiajie Xu, Jia Zhu, Jiawei Shen, Zhangze Chen, Sirui Han
Abstract:
Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at https://github.com/ghh1125/DOCTOR.
中文: 该摘要提出了DOCTOR模型,通过跨模态一致性学习和基于扩散的去噪方法,在共享空间中同步多模态特征并强化域不变特征,有效解决了短视频虚假信息检测中因领域差异导致的泛化性能下降问题。
English: This abstract introduces DOCTOR, a novel domain generalization model for short-video misinformation detection that employs cross-modal consistency learning and diffusion-based denoising to overcome performance degradation across unseen domains by synchronizing multi-modal features and enhancing domain-invariant representations.
Authors:Hanghui Guo, Weijie Shi, Mengze Li, Juncheng Li, Hao Chen, Yue Cui, Jiajie Xu, Jia Zhu, Jiawei Shen, Zhangze Chen, Sirui Han
Abstract:
Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at https://github.com/ghh1125/DOCTOR.
中文: 该摘要提出了DOCTOR模型,通过跨模态一致性学习和基于扩散的去噪方法,在共享空间中同步多模态特征并强化域不变特征,有效解决了短视频虚假信息检测中因领域差异导致的泛化性能下降问题。
English: This abstract introduces DOCTOR, a novel domain generalization model for short-video misinformation detection that employs cross-modal consistency learning and diffusion-based denoising to overcome performance degradation across unseen domains by synchronizing multi-modal features and enhancing domain-invariant representations.
Authors:Jianwei Tang, Jiangxin Sun, Xiaotong Lin, Lifang Zhang, Wei-Shi Zheng, Jian-Fang Hu
Abstract:
Human Motion Prediction (HMP) aims to predict future poses at different moments according to past motion sequences. Previous approaches have treated the prediction of various moments equally, resulting in two main limitations: the learning of short-term predictions is hindered by the focus on long-term predictions, and the incorporation of prior information from past predictions into subsequent predictions is limited. In this paper, we introduce a novel multi-stage training framework called Temporal Continual Learning (TCL) to address the above challenges. To better preserve prior information, we introduce the Prior Compensation Factor (PCF). We incorporate it into the model training to compensate for the lost prior information. Furthermore, we derive a more reasonable optimization objective through theoretical derivation. It is important to note that our TCL framework can be easily integrated with different HMP backbone models and adapted to various datasets and applications. Extensive experiments on four HMP benchmark datasets demonstrate the effectiveness and flexibility of TCL. The code is available at https://github.com/hyqlat/TCL.
Chinese: 本文提出了一种时序持续学习(TCL)框架,通过引入先验补偿因子来保留历史信息并优化训练目标,有效解决了人体运动预测中的现有局限,在多个基准数据集上验证了其有效性。
English: This paper introduces a Temporal Continual Learning (TCL) framework with a Prior Compensation Factor to address limitations in Human Motion Prediction by preserving prior information and optimizing training objectives, demonstrating effectiveness across multiple datasets.
Authors:Christopher Wiedeman, Anastasiia Sarmakeeva, Elena Sizikova, Daniil Filienko, Miguel Lago, Jana G. Delfino, Aldo Badano
Abstract:
One of the key impediments for developing and assessing robust medical imaging algorithms is limited access to large-scale datasets with suitable annotations. Synthetic data generated with plausible physical and biological constraints may address some of these data limitations. We propose the use of physics simulations to generate synthetic images with pixel-level segmentation annotations, which are notoriously difficult to obtain. Specifically, we apply this approach to breast imaging analysis and release T-SYNTH, a large-scale open-source dataset of paired 2D digital mammography (DM) and 3D digital breast tomosynthesis (DBT) images. Our initial experimental results indicate that T-SYNTH images show promise for augmenting limited real patient datasets for detection tasks in DM and DBT. Our data and code are publicly available at https://github.com/DIDSR/tsynth-release.
中文摘要:开发稳健医学影像算法的主要障碍在于缺乏大规模标注数据集,而基于物理模拟生成的合成数据可缓解此问题;T-SYNTH乳腺影像数据集的实验表明,其能有效增强真实数据在检测任务中的表现。
English Summary: The development of robust medical imaging algorithms is hindered by limited annotated datasets, which can be addressed through physics-based synthetic data generation, as demonstrated by the T-SYNTH dataset for breast imaging that shows potential for augmenting real data in detection tasks.
Authors:Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang, Ling Chen, Yang Zhao
Abstract:
We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at https://github.com/AIGeeksGroup/PresentAgent.
中文:PresentAgent是一种将长篇文档转化为同步视觉与语音叙述视频的多模态系统,其通过新型评估框架PresentEval验证,在各项关键指标上均展现出接近人类水平的演示质量。
English: PresentAgent is a multimodal system that converts long documents into synchronized narrated videos with human-like presentation quality, evaluated by the novel PresentEval framework which confirms its near-human performance across key metrics.
Authors:Andrii Kliachkin, Jana LepÅ¡ová, Gilles Bareilles, Jakub MareÄek
Abstract:
The ability to train Deep Neural Networks (DNNs) with constraints is instrumental in improving the fairness of modern machine-learning models. Many algorithms have been analysed in recent years, and yet there is no standard, widely accepted method for the constrained training of DNNs. In this paper, we provide a challenging benchmark of real-world large-scale fairness-constrained learning tasks, built on top of the US Census (Folktables). We point out the theoretical challenges of such tasks and review the main approaches in stochastic approximation algorithms. Finally, we demonstrate the use of the benchmark by implementing and comparing three recently proposed, but as-of-yet unimplemented, algorithms both in terms of optimization performance, and fairness improvement. We release the code of the benchmark as a Python package at https://github.com/humancompatible/train.
中文:本文基于美国人口普查数据,提出了一个具有挑战性的公平约束深度神经网络训练基准,并比较了三种未实施算法在优化性能和公平性改进方面的表现。
English: This paper introduces a challenging benchmark for training Deep Neural Networks with fairness constraints, based on the US Census data, and compares three unimplemented algorithms to evaluate their optimization and fairness performance.
Authors:Nayeon Kim, Eojin Jeon, Jun-Hyung Park, SangKeun Lee
Abstract:
In this study, we introduce KOPL, a novel framework for handling Korean OOV words with Phoneme representation Learning. Our work is based on the linguistic property of Korean as a phonemic script, the high correlation between phonemes and letters. KOPL incorporates phoneme and word representations for Korean OOV words, facilitating Korean OOV word representations to capture both text and phoneme information of words. We empirically demonstrate that KOPL significantly improves the performance on Korean Natural Language Processing (NLP) tasks, while being readily integrated into existing static and contextual Korean embedding models in a plug-and-play manner. Notably, we show that KOPL outperforms the state-of-the-art model by an average of 1.9%. Our code is available at https://github.com/jej127/KOPL.git.
Chinese: KOPL是一种创新框架,通过融合韩语词汇的音素和词表示来处理未登录词,显著提升了韩语自然语言处理任务的性能,平均优于现有最优模型1.9%,并能以即插即用方式兼容现有嵌入模型。
English: KOPL is a novel framework that enhances Korean NLP tasks by integrating phoneme and word representations for out-of-vocabulary words, achieving a 1.9% average improvement over state-of-the-art models and offering plug-and-play compatibility with existing embedding models.
Authors:Ziyang Miao, Qiyu Sun, Jingyuan Wang, Yuchen Gong, Yaowei Zheng, Shiqi Li, Richong Zhang
Abstract:
Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at https://github.com/ConardLi/easy-dataset and have garnered over 9,000 GitHub stars.
中文: Easy Dataset框架通过直观的图形界面,让用户能配置文本提取模型和分块策略,将非结构化文档转化为连贯文本块,并采用角色驱动提示方法生成多样化问答对,结合人工监督确保数据质量,实验表明基于该合成数据微调的大语言模型在特定领域任务中性能显著提升。
English: The Easy Dataset framework addresses the challenge of domain adaptation for large language models by providing a unified GUI tool that synthesizes high-quality fine-tuning data from unstructured documents through configurable extraction models and persona-driven prompting, with human oversight ensuring data quality and experimental results showing significant performance improvements in domain-specific tasks.
Authors:Seungjin Jung, Kanghee Lee, Yonghyun Jeong, Haeun Noh, Jungmin Lee, Jongwon Choi
Abstract:
Domain Generalizable Face Anti-Spoofing (DGFAS) methods effectively capture domain-invariant features by aligning the directions (weights) of local decision boundaries across domains. However, the bias terms associated with these boundaries remain misaligned, leading to inconsistent classification thresholds and degraded performance on unseen target domains. To address this issue, we propose a novel DGFAS framework that jointly aligns weights and biases through Feature Orthogonal Decomposition (FOD) and Group-wise Scaling Risk Minimization (GS-RM). Specifically, GS-RM facilitates bias alignment by balancing group-wise losses across multiple domains. FOD employs the Gram-Schmidt orthogonalization process to decompose the feature space explicitly into domain-invariant and domain-specific subspaces. By enforcing orthogonality between domain-specific and domain-invariant features during training using domain labels, FOD ensures effective weight alignment across domains without negatively impacting bias alignment. Additionally, we introduce Expected Calibration Error (ECE) as a novel evaluation metric for quantitatively assessing the effectiveness of our method in aligning bias terms across domains. Extensive experiments on benchmark datasets demonstrate that our approach achieves state-of-the-art performance, consistently improving accuracy, reducing bias misalignment, and enhancing generalization stability on unseen target domains.
中文摘要:该研究提出的DGFAS框架通过特征正交分解和组间缩放风险最小化技术,联合校准权重与偏置项,有效解决了跨域人脸活体检测中的偏置失准问题,在多个基准数据集上实现了最优性能。
English Summary: The proposed DGFAS framework addresses bias misalignment in face anti-spoofing systems by jointly aligning weights and biases through Feature Orthogonal Decomposition and Group-wise Scaling Risk Minimization, achieving state-of-the-art performance across domains.
Authors:Siyu Li, Fei Teng, Yihong Cao, Kailun Yang, Zhiyong Li, Yaonan Wang
Abstract:
Birds' Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems. Unsupervised and semi-supervised learning for BEV tasks, as pivotal for real-world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation. Specifically, a Perspective-Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi-Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non-mutual exclusivity inherent in BEV semantic segmentation tasks. Experimental results demonstrate that NRSeg achieves state-of-the-art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks, respectively. The source code will be made publicly available at https://github.com/lynn-yu/NRSeg.
中文: 本文提出NRSeg框架,通过一致性度量和鲁棒学习模块利用合成数据增强鸟瞰图语义分割,在无监督和半监督任务中实现了最先进的性能提升。
English: This paper introduces NRSeg, a noise-resilient framework that enhances BEV semantic segmentation by leveraging synthetic data through a consistency metric and robust learning modules, achieving state-of-the-art performance improvements in unsupervised and semi-supervised tasks.
Authors:Kai Ye, Tianyi Chen, Zhen Wang
Abstract:
With the increasing adoption of diffusion models for image generation and personalization, concerns regarding privacy breaches and content misuse have become more pressing. In this study, we conduct a comprehensive comparison of eight perturbation based protection methods: AdvDM, ASPL, FSGM, MetaCloak, Mist, PhotoGuard, SDS, and SimAC--across both portrait and artwork domains. These methods are evaluated under varying perturbation budgets, using a range of metrics to assess visual imperceptibility and protective efficacy. Our results offer practical guidance for method selection. Code is available at: https://github.com/vkeilo/DiffAdvPerturbationBench.
中文摘要:本研究全面评估了八种基于扰动的扩散模型保护方法,以应对隐私泄露和内容滥用问题,为在肖像和艺术作品领域选择有效技术提供了实用指导。
English Summary: This study comprehensively evaluates eight perturbation-based protection methods for diffusion models to address privacy and misuse concerns, providing practical guidance for selecting effective techniques across portrait and artwork domains.
Authors:Elian Neppel, Ashutosh Mishra, Shamistan Karimov, Kentaro Uno, Shreya Santra, Kazuya Yoshida
Abstract:
Modular robotics holds immense potential for space exploration, where reliability, repairability, and reusability are critical for cost-effective missions. Coordination between heterogeneous units is paramount for precision tasks -- whether in manipulation, legged locomotion, or multi-robot interaction. Such modular systems introduce challenges far exceeding those in monolithic robot architectures. This study presents a robust method for synchronizing the trajectories of multiple heterogeneous actuators, adapting dynamically to system variations with minimal system knowledge. This design makes it inherently robot-agnostic, thus highly suited for modularity. To ensure smooth trajectory adherence, the multidimensional state is constrained within a hypersphere representing the allowable deviation. The distance metric can be adapted hence, depending on the task and system under control, deformation of the constraint region is possible. This approach is compatible with a wide range of robotic platforms and serves as a core interface for Motion-Stack, our new open-source universal framework for limb coordination (available at https://github.com/2lian/Motion-Stack ). The method is validated by synchronizing the end-effectors of six highly heterogeneous robotic limbs, evaluating both trajectory adherence and recovery from significant external disturbances.
中文摘要:本研究提出了一种机器人无关的异构模块机器人同步方法,能动态适应系统变化并将运动约束在可调偏差范围内,已通过协调六个高度异构的机器人肢体得到验证。
English Summary: This study introduces a robot-agnostic synchronization method for heterogeneous modular robots that dynamically adapts to system variations while constraining motion within adjustable deviation limits, validated through coordination of six diverse robotic limbs.
Authors:Ha-Hieu Pham, Nguyen Lan Vi Vu, Thanh-Huy Nguyen, Ulas Bagci, Min Xu, Trung-Nghia Le, Huy-Hieu Pham
Abstract:
Accurate gland segmentation in histopathology images is essential for cancer diagnosis and prognosis. However, significant variability in Hematoxylin and Eosin (H&E) staining and tissue morphology, combined with limited annotated data, poses major challenges for automated segmentation. To address this, we propose Color-Structure Dual-Student (CSDS), a novel semi-supervised segmentation framework designed to learn disentangled representations of stain appearance and tissue structure. CSDS comprises two specialized student networks: one trained on stain-augmented inputs to model chromatic variation, and the other on structure-augmented inputs to capture morphological cues. A shared teacher network, updated via Exponential Moving Average (EMA), supervises both students through pseudo-labels. To further improve label reliability, we introduce stain-aware and structure-aware uncertainty estimation modules that adaptively modulate the contribution of each student during training. Experiments on the GlaS and CRAG datasets show that CSDS achieves state-of-the-art performance in low-label settings, with Dice score improvements of up to 1.2% on GlaS and 0.7% on CRAG at 5% labeled data, and 0.7% and 1.4% at 10%. Our code and pre-trained models are available at https://github.com/hieuphamha19/CSDS.
中文:提出的颜色-结构双学生(CSDS)框架通过两个专门的学生网络分别学习染色和结构特征,解决了组织病理学中腺体分割的难题,在有限标注数据下取得了最优性能。
English: The proposed Color-Structure Dual-Student (CSDS) framework addresses gland segmentation challenges in histopathology by using two specialized student networks to disentangle stain and structural features, achieving state-of-the-art results with limited labeled data.
Authors:Ikuya Yamada, Ryokan Ri, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Abstract:
Dense retrievers often struggle with queries involving less-frequent entities due to their limited entity knowledge. We propose the Knowledgeable Passage Retriever (KPR), a BERT-based retriever enhanced with a context-entity attention layer and dynamically updatable entity embeddings. This design enables KPR to incorporate external entity knowledge without retraining. Experiments on three datasets demonstrate that KPR consistently improves retrieval accuracy, with particularly large gains on the EntityQuestions dataset. When built on the off-the-shelf bge-base retriever, KPR achieves state-of-the-art performance among similarly sized models on two datasets. Models and code are released at https://github.com/knowledgeable-embedding/knowledgeable-embedding.
Chinese: 知识化段落检索器(KPR)通过上下文实体注意力层和动态更新的实体嵌入,有效融入外部实体知识,无需重新训练即可显著提升检索精度,并在多个数据集上实现最优性能。
English: The Knowledgeable Passage Retriever (KPR) enhances dense retrieval by integrating external entity knowledge through a context-entity attention layer and dynamic embeddings, achieving state-of-the-art performance without retraining.
Authors:Shubin Ma, Liang Zhao, Mingdong Lu, Yifan Guo, Bo Xu
Abstract:
Multimodal representation is faithful and highly effective in describing real-world data samples' characteristics by describing their complementary information. However, the collected data often exhibits incomplete and misaligned characteristics due to factors such as inconsistent sensor frequencies and device malfunctions. Existing research has not effectively addressed the issue of filling missing data in scenarios where multiview data are both imbalanced and misaligned. Instead, it relies on class-level alignment of the available data. Thus, it results in some data samples not being well-matched, thereby affecting the quality of data fusion. In this paper, we propose the Consistency-Aware Padding for Incomplete Multimodal Alignment Clustering Based on Self-Repellent Greedy Anchor Search(CAPIMAC) to tackle the problem of filling imbalanced and misaligned data in multimodal datasets. Specifically, we propose a self-repellent greedy anchor search module(SRGASM), which employs a self-repellent random walk combined with a greedy algorithm to identify anchor points for re-representing incomplete and misaligned multimodal data. Subsequently, based on noise-contrastive learning, we design a consistency-aware padding module (CAPM) to effectively interpolate and align imbalanced and misaligned data, thereby improving the quality of multimodal data fusion. Experimental results demonstrate the superiority of our method over benchmark datasets. The code will be publicly released at https://github.com/Autism-mm/CAPIMAC.git.
Chinese: 本文提出CAPIMAC方法,通过自排斥贪婪锚点搜索和一致性感知填充技术,有效处理多模态数据的不完整与未对齐问题,提升数据融合质量并在基准数据集上表现优越。
English: The paper introduces CAPIMAC, a method that uses a self-repellent greedy anchor search and consistency-aware padding to address incomplete and misaligned multimodal data, enhancing fusion quality and outperforming benchmarks.
Authors:Yifan Jiang, Yibo Xue, Yukun Kang, Pin Zheng, Jian Peng, Feiran Wu, Changliang Xu
Abstract:
Slide animations, such as fade-in, fly-in, and wipe, are critical for audience engagement, efficient information delivery, and vivid visual expression. However, most AI-driven slide-generation tools still lack native animation support, and existing vision-language models (VLMs) struggle with animation tasks due to the absence of public datasets and limited temporal-reasoning capabilities. To address this gap, we release the first public dataset for slide-animation modeling: 12,000 triplets of natural-language descriptions, animation JSON files, and rendered videos, collectively covering every built-in PowerPoint effect. Using this resource, we fine-tune Qwen-2.5-VL-7B with Low-Rank Adaptation (LoRA) and achieve consistent improvements over GPT-4.1 and Gemini-2.5-Pro in BLEU-4, ROUGE-L, SPICE, and our Coverage-Order-Detail Assessment (CODA) metric, which evaluates action coverage, temporal order, and detail fidelity. On a manually created test set of slides, the LoRA model increases BLEU-4 by around 60%, ROUGE-L by 30%, and shows significant improvements in CODA-detail. This demonstrates that low-rank adaptation enables reliable temporal reasoning and generalization beyond synthetic data. Overall, our dataset, LoRA-enhanced model, and CODA metric provide a rigorous benchmark and foundation for future research on VLM-based dynamic slide generation.
Chinese: 本研究发布了首个幻灯片动画建模公共数据集,并通过LoRA微调Qwen-2.5-VL-7B模型,在时序推理和动画生成质量上显著超越主流模型,为动态幻灯片生成领域建立了新基准。
English: This work introduces the first public dataset for slide animation modeling and demonstrates that fine-tuning Qwen-2.5-VL-7B with LoRA significantly outperforms leading models in temporal reasoning and animation generation, establishing a new benchmark for dynamic slide creation.
Authors:Ishan Khurjekar, Indrashish Saha, Lori Graham-Brady, Somdatta Goswami
Abstract:
Systems governed by partial differential equations (PDEs) require computationally intensive numerical solvers to predict spatiotemporal field evolution. While machine learning (ML) surrogates offer faster solutions, autoregressive inference with ML models suffer from error accumulation over successive predictions, limiting their long-term accuracy. We propose a deep ensemble framework to address this challenge, where multiple ML surrogate models with random weight initializations are trained in parallel and aggregated during inference. This approach leverages the diversity of model predictions to mitigate error propagation while retaining the autoregressive strategies ability to capture the system's time dependent relations. We validate the framework on three PDE-driven dynamical systems - stress evolution in heterogeneous microstructures, Gray-Scott reaction-diffusion, and planetary-scale shallow water system - demonstrating consistent reduction in error accumulation over time compared to individual models. Critically, the method requires only a few time steps as input, enabling full trajectory predictions with inference times significantly faster than numerical solvers. Our results highlight the robustness of ensemble methods in diverse physical systems and their potential as efficient and accurate alternatives to traditional solvers. The codes for this work are available on GitHub (https://github.com/Graham-Brady-Research-Group/AutoregressiveEnsemble_SpatioTemporal_Evolution).
中文: 作者提出了一种深度集成框架,通过并行训练多个机器学习代理模型并聚合其预测,有效减少了偏微分方程系统中自回归推理的误差累积,在三个物理应用中验证了其准确性和效率的提升。
English: The authors propose a deep ensemble framework that trains multiple machine learning surrogates in parallel and aggregates their predictions to reduce error accumulation in autoregressive inference for PDE-based systems, demonstrating improved accuracy and efficiency across three physical applications.
Authors:Jiaqi Zhang, Juntuo Wang, Zhixin Sun, John Zou, Randall Balestriero
Abstract:
Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning--which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence--and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum--low-frequency being seen first--and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at https://github.com/KevinZ0217/fast_dinov2
Chinese: 本文提出了一种新颖的DINOv2预训练策略,通过频率过滤课程和高斯噪声增强技术,在加速收敛的同时提升了模型鲁棒性,在显著减少计算成本的情况下,仍能在ImageNet-C基准测试中保持相当的鲁棒性表现和线性探测性能。
English: This paper introduces a novel pre-training strategy for DINOv2 that accelerates convergence and enhances robustness through frequency filtering and Gaussian noise augmentation, achieving significant computational savings while maintaining competitive performance on corruption benchmarks and linear probing tasks.
Authors:Yansong Peng, Kai Zhu, Yu Liu, Pingyu Wu, Hebei Li, Xiaoyan Sun, Feng Wu
Abstract:
Continuous-time Consistency Models (CMs) promise efficient few-step generation but face significant challenges with training instability. We argue this instability stems from a fundamental conflict: by training a network to learn only a shortcut across a probability flow, the model loses its grasp on the instantaneous velocity field that defines the flow. Our solution is to explicitly anchor the model in the underlying flow during training. We introduce the Flow-Anchored Consistency Model (FACM), a simple but effective training strategy that uses a Flow Matching (FM) task as an anchor for the primary CM shortcut objective. This Flow-Anchoring approach requires no architectural modifications and is broadly compatible with standard model architectures. By distilling a pre-trained LightningDiT model, our method achieves a state-of-the-art FID of 1.32 with two steps (NFE=2) and 1.76 with just one step (NFE=1) on ImageNet 256x256, significantly outperforming previous methods. This provides a general and effective recipe for building high-performance, few-step generative models. Our code and pretrained models: https://github.com/ali-vilab/FACM.
中文摘要:流锚定一致性模型(FACM)通过引入流匹配任务作为训练锚点,有效解决了连续性一致性模型的训练不稳定问题,在无需改变架构的情况下实现了最优的少步生成性能。
English Summary: The Flow-Anchored Consistency Model (FACM) resolves training instability in Consistency Models by incorporating a Flow Matching objective as an anchor, achieving state-of-the-art few-step generation performance without architectural changes.
Authors:Zhiling Yan, Sifan Song, Dingjie Song, Yiwei Li, Rong Zhou, Weixiang Sun, Zhennong Chen, Sekeun Kim, Hui Ren, Tianming Liu, Quanzheng Li, Xiang Li, Lifang He, Lichao Sun
Abstract:
Recent "segment anything" efforts show promise by learning from large-scale data, but adapting such models directly to medical images remains challenging due to the complexity of medical data, noisy annotations, and continual learning requirements across diverse modalities and anatomical structures. In this work, we propose SAMed-2, a new foundation model for medical image segmentation built upon the SAM-2 architecture. Specifically, we introduce a temporal adapter into the image encoder to capture image correlations and a confidence-driven memory mechanism to store high-certainty features for later retrieval. This memory-based strategy counters the pervasive noise in large-scale medical datasets and mitigates catastrophic forgetting when encountering new tasks or modalities. To train and evaluate SAMed-2, we curate MedBank-100k, a comprehensive dataset spanning seven imaging modalities and 21 medical segmentation tasks. Our experiments on both internal benchmarks and 10 external datasets demonstrate superior performance over state-of-the-art baselines in multi-task scenarios. The code is available at: https://github.com/ZhilingYan/Medical-SAM-Bench.
中文: SAMed-2是一种医学图像分割基础模型,通过时序适配器和置信度驱动记忆机制解决数据复杂性和噪声问题,在多项任务和数据集上展现出卓越性能。
English: SAMed-2 is a medical image segmentation foundation model that addresses data complexity and noise through a temporal adapter and confidence-driven memory, demonstrating superior performance across multiple tasks and datasets.
Authors:Ankit Sonthalia, Arnas Uselis, Seong Joon Oh
Abstract:
We study whether visual embedding models capture continuous, ordinal attributes along linear directions, which we term _rank axes_. We define a model as _rankable_ for an attribute if projecting embeddings onto such an axis preserves the attribute's order. Across 7 popular encoders and 9 datasets with attributes like age, crowd count, head pose, aesthetics, and recency, we find that many embeddings are inherently rankable. Surprisingly, a small number of samples, or even just two extreme examples, often suffice to recover meaningful rank axes, without full-scale supervision. These findings open up new use cases for image ranking in vector databases and motivate further study into the structure and learning of rankable embeddings. Our code is available at https://github.com/aktsonthalia/rankable-vision-embeddings.
中文: 研究表明视觉嵌入模型能通过线性排序轴自然捕捉有序属性,仅需少量样本即可在多数据集和编码器中实现有效排序,为向量数据库的图像排序应用开辟了新途径。
English: This research demonstrates that visual embedding models inherently capture ordinal attributes through linear rank axes, requiring minimal supervision to recover meaningful rankings across diverse datasets and encoders.
Authors:José A. Pardo, Tomás Bernal, Jaime Ãiguez, Ana Luisa Gil-MartÃnez, Laura Ibañez, José T. Palma, Juan A. BotÃa, Alicia Gómez-Pascual
Abstract:
Inconsistencies between clinical and omics data may arise within medical cohorts. The identification, annotation and explanation of anomalous omics-based patients or individuals may become crucial to better reshape the disease, e.g., by detecting early onsets signaled by the omics and undetectable from observable symptoms. Here, we developed MLASDO (Machine Learning based Anomalous Sample Detection on Omics), a new method and software tool to identify, characterize and automatically describe anomalous samples based on omics data. Its workflow is based on three steps: (1) classification of healthy and cases individuals using a support vector machine algorithm; (2) detection of anomalous samples within groups; (3) explanation of anomalous individuals based on clinical data and expert knowledge. We showcase MLASDO using transcriptomics data of 317 healthy controls (HC) and 465 Parkinson's disease (PD) cases from the Parkinson's Progression Markers Initiative. In this cohort, MLASDO detected 15 anomalous HC with a PD-like transcriptomic signature and PD-like clinical features, including a lower proportion of CD4/CD8 naive T-cells and CD4 memory T-cells compared to HC (P<3.5*10^-3). MLASDO also identified 22 anomalous PD cases with a transcriptomic signature more similar to that of HC and some clinical features more similar to HC, including a lower proportion of mature neutrophils compared to PD cases (P<6*10^-3). In summary, MLASDO is a powerful tool that can help the clinician to detect and explain anomalous HC and cases of interest to be followed up. MLASDO is an open-source R package available at: https://github.com/JoseAdrian3/MLASDO.
中文: MLASDO是一种基于机器学习的工具,能检测并解释组学数据中的异常样本,帮助临床医生识别具有非典型疾病特征的患者以供后续跟踪。
English: MLASDO is a machine learning tool that detects and explains anomalous samples in omics data, helping clinicians identify patients with atypical disease signatures for further follow-up.
Authors:Yingxu Wang, Siwei Liu, Jinyuan Fang, Zaiqiao Meng
Abstract:
Multi-agent systems (MAS) have emerged as a powerful paradigm for orchestrating large language models (LLMs) and specialized tools to collaboratively address complex tasks. However, existing MAS frameworks often require manual workflow configuration and lack native support for dynamic evolution and performance optimization. In addition, many MAS optimization algorithms are not integrated into a unified framework. In this paper, we present EvoAgentX, an open-source platform that automates the generation, execution, and evolutionary optimization of multi-agent workflows. EvoAgentX employs a modular architecture consisting of five core layers: the basic components, agent, workflow, evolving, and evaluation layers. Specifically, within the evolving layer, EvoAgentX integrates three MAS optimization algorithms, TextGrad, AFlow, and MIPRO, to iteratively refine agent prompts, tool configurations, and workflow topologies. We evaluate EvoAgentX on HotPotQA, MBPP, and MATH for multi-hop reasoning, code generation, and mathematical problem solving, respectively, and further assess it on real-world tasks using GAIA. Experimental results show that EvoAgentX consistently achieves significant performance improvements, including a 7.44% increase in HotPotQA F1, a 10.00% improvement in MBPP pass@1, a 10.00% gain in MATH solve accuracy, and an overall accuracy improvement of up to 20.00% on GAIA. The source code is available at: https://github.com/EvoAgentX/EvoAgentX
中文: EvoAgentX是一个开源平台,能自动生成、执行和进化优化多智能体工作流,集成了TextGrad、AFlow和MIPRO等算法,在推理、代码生成和数学解题任务中显著提升性能。
English: EvoAgentX is an open-source platform that automates the creation, execution, and evolutionary optimization of multi-agent workflows, integrating algorithms like TextGrad, AFlow, and MIPRO to enhance performance across reasoning, coding, and math tasks with significant improvements.
Authors:Yana Hasson, Pauline Luc, Liliane Momeni, Maks Ovsjanikov, Guillaume Le Moing, Alina Kuznetsova, Ira Ktena, Jennifer J. Sun, Skanda Koppula, Dilara Gokay, Joseph Heyward, Etienne Pot, Andrew Zisserman
Abstract:
In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general-purpose domain-agnostic approaches. However, it is not known whether the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific disciplines, and if a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five *Sci*entific *Vid*eo tasks, across medical computer vision, animal behavior, and weather forecasting. We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by leveraging the general-purpose representations from ViFM backbones. Furthermore, our results reveal the limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications. We release our code at https://github.com/google-deepmind/scivid to facilitate further research in the development of ViFMs.
Chinese: 视频基础模型在科学应用中展现出作为通用工具的潜力,SciVid基准测试证明其通过迁移学习可获得先进成果,同时也揭示了现有模型的局限性。
English: Video foundation models show potential as general-purpose tools for scientific applications, with the SciVid benchmark demonstrating their ability to achieve state-of-the-art results through transfer learning while also revealing current limitations.
Authors:Eva Seidlmayer, Lukas Galke, Konrad U. Förstner
Abstract:
Disseminators of disinformation often seek to attract attention or evoke emotions - typically to gain influence or generate revenue - resulting in distinctive rhetorical patterns that can be exploited by machine learning models. In this study, we explore linguistic and rhetorical features as proxies for distinguishing disinformative texts from other health and life-science text genres, applying both large language models and classical machine learning classifiers. Given the limitations of existing datasets, which mainly focus on fact checking misinformation, we introduce Four Shades of Life Sciences (FSoLS): a novel, labeled corpus of 2,603 texts on 14 life-science topics, retrieved from 17 diverse sources and classified into four categories of life science publications. The source code for replicating, and updating the dataset is available on GitHub: https://github.com/EvaSeidlmayer/FourShadesofLifeSciences
中文摘要:本研究引入FSoLS语料库,通过大型语言模型和传统分类器分析虚假信息的修辞特征,以区分健康与生命科学领域的误导性文本。
English Summary: This study introduces the FSoLS corpus to analyze distinctive rhetorical patterns in disinformation, using both large language models and classical classifiers to distinguish deceptive texts in health and life sciences.
Authors:Gulcin Baykal, Abdullah Akgül, Manuel Haussmann, Bahareh Tasdighi, Nicklas Werge, Yi-Shan Wu, Melih Kandemir
Abstract:
ObjectRL is an open-source Python codebase for deep reinforcement learning (RL), designed for research-oriented prototyping with minimal programming effort. Unlike existing codebases, ObjectRL is built on Object-Oriented Programming (OOP) principles, providing a clear structure that simplifies the implementation, modification, and evaluation of new algorithms. ObjectRL lowers the entry barrier for deep RL research by organizing best practices into explicit, clearly separated components, making them easier to understand and adapt. Each algorithmic component is a class with attributes that describe key RL concepts and methods that intuitively reflect their interactions. The class hierarchy closely follows common ontological relationships, enabling data encapsulation, inheritance, and polymorphism, which are core features of OOP. We demonstrate the efficiency of ObjectRL's design through representative use cases that highlight its flexibility and suitability for rapid prototyping. The documentation and source code are available at https://objectrl.readthedocs.io and https://github.com/adinlab/objectrl .
ObjectRL是一个基于面向对象编程的开源Python深度强化学习框架,它通过清晰的组件分离简化算法实现与评估,显著降低了研究门槛并提升了开发效率。
ObjectRL is an open-source Python framework for deep reinforcement learning that employs object-oriented programming to streamline algorithm development and prototyping, making research more accessible and efficient.
Authors:Pablo Alonso-Jiménez, Pedro Ramoneda, R. Oguz Araz, Andrea Poltronieri, Dmitry Bogdanov
Abstract:
Developing open-source foundation models is essential for advancing research in music audio understanding and ensuring access to powerful, multipurpose representations for music information retrieval. We present OMAR-RQ, a model trained with self-supervision via masked token classification methodologies using a large-scale dataset with over 330,000 hours of music audio. We experiment with different input features and quantization options, and achieve state-of-the-art performance in music tagging, pitch estimation, chord recognition, beat tracking, segmentation, and difficulty estimation among open self-supervised models. We open-source our training and evaluation pipelines and model weights, available at https://github.com/mtg/omar-rq.
中文: OMAR-RQ是一个基于33万小时音乐音频训练的开源自监督模型,在多项音乐信息检索任务中实现最优性能,并公开了代码与模型权重。
English: OMAR-RQ is an open-source self-supervised model trained on over 330,000 hours of music audio, achieving state-of-the-art performance in multiple music information retrieval tasks and making its code and weights publicly available.
Authors:Zetian Feng, Juan Fu, Xuebin Zou, Hongsheng Ye, Hong Wu, Jianhua Zhou, Yi Wang
Abstract:
Prostate cancer (PCa) is a leading cause of cancer-related mortality in men, and accurate identification of clinically significant PCa (csPCa) is critical for timely intervention. Transrectal ultrasound (TRUS) is widely used for prostate biopsy; however, its low contrast and anisotropic spatial resolution pose diagnostic challenges. To address these limitations, we propose a novel hybrid-view attention (HVA) network for csPCa classification in 3D TRUS that leverages complementary information from transverse and sagittal views. Our approach integrates a CNN-transformer hybrid architecture, where convolutional layers extract fine-grained local features and transformer-based HVA models global dependencies. Specifically, the HVA comprises intra-view attention to refine features within a single view and cross-view attention to incorporate complementary information across views. Furthermore, a hybrid-view adaptive fusion module dynamically aggregates features along both channel and spatial dimensions, enhancing the overall representation. Experiments are conducted on an in-house dataset containing 590 subjects who underwent prostate biopsy. Comparative and ablation results prove the efficacy of our method. The code is available at https://github.com/mock1ngbrd/HVAN.
中文摘要:本研究提出一种混合视角注意力网络,通过结合卷积和变换器架构,利用横断面与矢状面的互补信息,对三维经直肠超声图像中的临床显著前列腺癌进行分类。
English Summary: This study introduces a hybrid-view attention network for classifying clinically significant prostate cancer in 3D TRUS images by combining convolutional and transformer architectures to leverage complementary transverse and sagittal view information.
Authors:Qing Li, Huifang Feng, Xun Gong, Yu-Shen Liu
Abstract:
Estimating normals for noisy point clouds is a persistent challenge in 3D geometry processing, particularly for end-to-end oriented normal estimation. Existing methods generally address relatively clean data and rely on supervised priors to fit local surfaces within specific neighborhoods. In this paper, we propose a novel approach for learning normals from noisy point clouds through local gradient-aware surface filtering. Our method projects noisy points onto the underlying surface by utilizing normals and distances derived from an implicit function constrained by local gradients. We start by introducing a distance measurement operator for global surface fitting on noisy data, which integrates projected distances along normals. Following this, we develop an implicit field-based filtering approach for surface point construction, adding projection constraints on these points during filtering. To address issues of over-smoothing and gradient degradation, we further incorporate local gradient consistency constraints, as well as local gradient orientation and aggregation. Comprehensive experiments on normal estimation, surface reconstruction, and point cloud denoising demonstrate the state-of-the-art performance of our method. The source code and trained models are available at https://github.com/LeoQLi/LGSF.
中文摘要:本文提出了一种通过局部梯度感知表面滤波从噪声点云学习法向量的新方法,利用隐函数约束的局部梯度将噪声点投影到潜在表面,有效解决了过平滑和梯度退化问题。
English Summary: This paper introduces a novel method for estimating normals from noisy point clouds using local gradient-aware surface filtering, which projects noisy points onto underlying surfaces through implicit functions constrained by local gradients to prevent over-smoothing and gradient degradation.
Authors:Yufan Zhou, Zhaobo Qi, Lingshuai Lin, Junqi Jing, Tingting Chai, Beichen Zhang, Shuhui Wang, Weigang Zhang
Abstract:
In this paper, we address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations. Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions. Building on these efforts, we propose the Masked Temporal Interpolation Diffusion (MTID) model that introduces a latent space temporal interpolation module within the diffusion model. This module leverages a learnable interpolation matrix to generate intermediate latent features, thereby augmenting visual supervision with richer mid-state details. By integrating this enriched supervision into the model, we enable end-to-end training tailored to task-specific requirements, significantly enhancing the model's capacity to predict temporally coherent action sequences. Additionally, we introduce an action-aware mask projection mechanism to restrict the action generation space, combined with a task-adaptive masked proximity loss to prioritize more accurate reasoning results close to the given start and end states over those in intermediate steps. Simultaneously, it filters out task-irrelevant action predictions, leading to contextually aware action sequences. Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at https://github.com/WiserZhou/MTID.
中文: 本文提出掩码时序插值扩散(MTID)模型,通过生成中间潜在特征并结合任务特定监督,在指导性视频中实现连贯动作序列的规划,显著提升了动作预测的准确性。
English: This paper introduces the Masked Temporal Interpolation Diffusion (MTID) model, which enhances procedure planning in instructional videos by generating intermediate latent features and incorporating task-specific supervision to produce coherent action sequences.
Authors:Blaž Rolih, Matic FuÄka, Filip Wolf, Luka Äehovin Zajc
Abstract:
Remote sensing change detection aims to localize semantic changes between images of the same location captured at different times. In the past few years, newer methods have attributed enhanced performance to the additions of new and complex components to existing architectures. Most fail to measure the performance contribution of fundamental design choices such as backbone selection, pre-training strategies, and training configurations. We claim that such fundamental design choices often improve performance even more significantly than the addition of new architectural components. Due to that, we systematically revisit the design space of change detection models and analyse the full potential of a well-optimised baseline. We identify a set of fundamental design choices that benefit both new and existing architectures. Leveraging this insight, we demonstrate that when carefully designed, even an architecturally simple model can match or surpass state-of-the-art performance on six challenging change detection datasets. Our best practices generalise beyond our architecture and also offer performance improvements when applied to related methods, indicating that the space of fundamental design choices has been underexplored. Our guidelines and architecture provide a strong foundation for future methods, emphasizing that optimizing core components is just as important as architectural novelty in advancing change detection performance. Code: https://github.com/blaz-r/BTC-change-detection
中文: 本研究证明,优化主干网络选择与训练策略等基础设计要素能显著提升变化检测性能,其效果常超越复杂架构改进,该结论在六个数据集上得到验证。
English: This study demonstrates that optimizing fundamental design choices like backbone selection and training strategies can significantly enhance change detection performance, often surpassing the benefits of complex architectural additions, as validated across six datasets.
Authors:Mingzhuo Li, Guang Li, Jiafeng Mao, Linfeng Ye, Takahiro Ogawa, Miki Haseyama
Abstract:
To alleviate the reliance of deep neural networks on large-scale datasets, dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset. The integration of generative models has significantly advanced this field. However, existing approaches primarily focus on aligning the distilled dataset with the original one, often overlooking task-specific information that can be critical for optimal downstream performance. In this paper, focusing on the downstream task of classification, we propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better. The final dataset is sampled from a larger image pool with a sampling distribution obtained by matching the difficulty distribution of the original dataset. A logarithmic transformation is applied as a pre-processing step to correct for distributional bias. The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks. The code is available at https://github.com/SumomoTaku/DiffGuideSamp.
中文: 本文提出了一种针对生成式数据集蒸馏的任务特定采样策略,通过引入难度概念并匹配原始数据集的难度分布,有效提升了分类任务的性能。
English: This paper introduces a task-specific sampling strategy for generative dataset distillation that incorporates difficulty-based selection to enhance classification performance by aligning with the original dataset's difficulty distribution.
Authors:S. Lee, C. Myers, A. Yang, T. Zhang, S. J. L. Billinge
Abstract:
Scientific advancement relies on the ability to share and reproduce results. When data analysis or calculations are carried out using software written by scientists there are special challenges around code versions, quality and code sharing. scikit-package provides a roadmap to facilitate code reuse and sharing with minimal effort through tutorials coupled with automated and centralized reusable workflows. The goal of the project is to provide pedagogical and practical tools for scientists who are not professionally trained software engineers to write more reusable and maintainable software code. Code reuse can occur at multiple levels of complexity-from turning a code block into a function within a single script, to publishing a publicly installable, fully tested, and documented software package scikit-package provides a community maintained set of tools, and a roadmap, to help scientists bring their software higher levels of reproducibility and shareability.
中文摘要:科学进步依赖于可复现成果的共享,而scikit-package通过教程与自动化工作流程,帮助非专业程序员创建可复用、可维护的代码,涵盖从简单函数到完整软件包的不同复杂度层级。
English Summary: Scientific progress depends on sharing reproducible results, and scikit-package offers tutorials and automated workflows to help non-professional programmers create reusable, maintainable code across various complexity levels.
Authors:Peilin Tao, Hainan Cui, Diantao Tu, Shuhan Shen
Abstract:
Multi-camera systems are increasingly vital in the environmental perception of autonomous vehicles and robotics. Their physical configuration offers inherent fixed relative pose constraints that benefit Structure-from-Motion (SfM). However, traditional global SfM systems struggle with robustness due to their optimization framework. We propose a novel global motion averaging framework for multi-camera systems, featuring two core components: a decoupled rotation averaging module and a hybrid translation averaging module. Our rotation averaging employs a hierarchical strategy by first estimating relative rotations within rigid camera units and then computing global rigid unit rotations. To enhance the robustness of translation averaging, we incorporate both camera-to-camera and camera-to-point constraints to initialize camera positions and 3D points with a convex distance-based objective function and refine them with an unbiased non-bilinear angle-based objective function. Experiments on large-scale datasets show that our system matches or exceeds incremental SfM accuracy while significantly improving efficiency. Our framework outperforms existing global SfM methods, establishing itself as a robust solution for real-world multi-camera SfM applications. The code is available at https://github.com/3dv-casia/MGSfM/.
中文: 我们提出了一种针对多相机系统的鲁棒全局运动平均框架,通过分层旋转估计和混合平移优化,在精度和效率上均超越了现有方法。
English: We introduce a robust global motion averaging framework for multi-camera systems, featuring hierarchical rotation estimation and hybrid translation optimization that achieves superior accuracy and efficiency compared to existing methods.
Authors:Wooseok Shin, Jisu Kang, Hyeonki Jeong, Jin Sob Kim, Sung Won Han
Abstract:
In semi-supervised semantic segmentation, existing studies have shown promising results in academic settings with controlled splits of benchmark datasets. However, the potential benefits of leveraging significantly larger sets of unlabeled images remain unexplored. In real-world scenarios, abundant unlabeled images are often available from online sources (web-scraped images) or large-scale datasets. However, these images may have different distributions from those of the target dataset, a situation known as out-of-distribution (OOD). Using these images as unlabeled data in semi-supervised learning can lead to inaccurate pseudo-labels, potentially misguiding network training. In this paper, we propose a new semi-supervised semantic segmentation framework with an open-vocabulary segmentation model (SemiOVS) to effectively utilize unlabeled OOD images. Extensive experiments on Pascal VOC and Context datasets demonstrate two key findings: (1) using additional unlabeled images improves the performance of semi-supervised learners in scenarios with few labels, and (2) using the open-vocabulary segmentation (OVS) model to pseudo-label OOD images leads to substantial performance gains. In particular, SemiOVS outperforms existing PrevMatch and SemiVL methods by +3.5 and +3.0 mIoU, respectively, on Pascal VOC with a 92-label setting, achieving state-of-the-art performance. These findings demonstrate that our approach effectively utilizes abundant unlabeled OOD images for semantic segmentation tasks. We hope this work can inspire future research and real-world applications. The code is available at https://github.com/wooseok-shin/SemiOVS
中文: 本文提出的SemiOVS框架通过开放词汇分割有效利用分布外未标记图像,在基准数据集上实现了半监督语义分割的最先进性能。
English: This paper introduces SemiOVS, a semi-supervised semantic segmentation framework that leverages open-vocabulary segmentation to effectively utilize out-of-distribution unlabeled images, achieving state-of-the-art performance on benchmark datasets.
Authors:Zedong Peng, Zeju Li, Mingzhe Gao, Qiang Xu, Chen Zhang, Jieru Zhao
Abstract:
High-Level Synthesis (HLS) plays a crucial role in modern hardware design by transforming high-level code into optimized hardware implementations. However, progress in applying machine learning (ML) to HLS optimization has been hindered by a shortage of sufficiently large and diverse datasets. To bridge this gap, we introduce ForgeHLS, a large-scale, open-source dataset explicitly designed for ML-driven HLS research. ForgeHLS comprises over 400k diverse designs generated from 846 kernels covering a broad range of application domains, consuming over 200k CPU hours during dataset construction. Each kernel includes systematically automated pragma insertions (loop unrolling, pipelining, array partitioning), combined with extensive design space exploration using Bayesian optimization. Compared to existing datasets, ForgeHLS significantly enhances scale, diversity, and design coverage. We further define and evaluate representative downstream tasks in Quality of Result (QoR) prediction and automated pragma exploration, clearly demonstrating ForgeHLS utility for developing and improving ML-based HLS optimization methodologies. The dataset and code are public at https://github.com/zedong-peng/ForgeHLS.
中文: ForgeHLS数据集通过提供超过40万个多样化设计,解决了高层次综合中机器学习研究数据不足的问题,显著提升了结果质量预测和编译指示优化的能力。
English: The ForgeHLS dataset addresses the scarcity of large-scale data for machine learning in High-Level Synthesis by providing over 400,000 diverse designs and enabling improved QoR prediction and pragma optimization.
Authors:Liangyu Wang, Huanyi Xie, Di Wang
Abstract:
Fine-tuning large language models (LLMs) remains resource-intensive due to their sheer scale. While zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating backward passes, its application to multi-hundred-billion-parameter models is constrained by GPU memory and compute throughput. The ZO2 framework addresses the memory bottleneck by offloading model parameters to CPU memory and overlapping transformer block transfer with dual forward computation on a single GPU. However, ZO2 remains limited by its single-device execution and achieves modest throughput. In this work, we present DistZO2, a high-throughput, memory-efficient framework for distributed zeroth-order fine-tuning of LLMs. DistZO2 introduces three parallel strategies: (1) Perturbation Parallelism (PertP), which parallelizes the two perturbed forward passes across devices; (2) Distributed Data Parallelism (DDP), adapted to the scalar-gradient nature of ZO training; and (3) a unified 2D Parallelism design that combines PertP and DDP. To further mitigate communication bottlenecks introduced by parameter offloading, we propose a hardware-aware communication strategy that slices parameter blocks and redistributes them across GPUs via high-speed interconnects such as NVLink. DistZO2 scales zeroth-order fine-tuning to modern multi-GPU systems, preserving ZO2's memory efficiency while substantially improving training throughput. In our experiments on OPT-175B, DistZO2 achieves a 3x speedup over ZO2 with distributed computing. DistZO2's code has been open-sourced in https://github.com/liangyuwang/zo2.
中文:DistZO2是一个分布式框架,通过引入并行策略和硬件感知通信技术,在保持内存效率的同时显著提升了大型语言模型的零阶微调速度。
English: DistZO2 is a distributed framework that enhances zeroth-order fine-tuning of large language models by introducing parallel strategies and hardware-aware communication, achieving significant speed improvements while maintaining memory efficiency.
Authors:Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi
Abstract:
Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply -- termed the "caution" direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at https://github.com/ky295/reasoning-manipulation.
中文摘要:本研究发现推理模型的思维链标记中存在决定拒绝行为的"谨慎"方向,通过操控该方向可有效实现越狱攻击,显著提高模型的有害服从率。
English Summary: This study reveals that reasoning models' chain-of-thought tokens contain a "caution" direction in activation space that governs refusal behavior, and manipulating this direction enables effective jailbreak attacks by increasing harmful compliance.
Authors:Oscar Dowson, Robert B Parker, Russel Bent
Abstract:
We present \texttt{MathOptAI.jl}, an open-source Julia library for embedding trained machine learning predictors into a JuMP model. \texttt{MathOptAI.jl} can embed a wide variety of neural networks, decision trees, and Gaussian Processes into a larger mathematical optimization model. In addition to interfacing a range of Julia-based machine learning libraries such as \texttt{Lux.jl} and \texttt{Flux.jl}, \texttt{MathOptAI.jl} uses Julia's Python interface to provide support for PyTorch models. When the PyTorch support is combined with \texttt{MathOptAI.jl}'s gray-box formulation, the function, Jacobian, and Hessian evaluations associated with the PyTorch model are offloaded to the GPU in Python, while the rest of the nonlinear oracles are evaluated on the CPU in Julia. \MathOptAI is available at https://github.com/lanl-ansi/MathOptAI.jl under a BSD-3 license.
中文: MathOptAI.jl 是一个开源 Julia 库,可将训练好的机器学习模型嵌入到 JuMP 优化模型中,支持多种预测器,并能利用 GPU 加速 PyTorch 模型的计算。
English: MathOptAI.jl is an open-source Julia library that integrates trained machine learning models into JuMP optimization models, supporting various predictors and leveraging GPU acceleration for PyTorch models.
Authors:Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis, Eddy D. Zandee van Rilland, Poonam Laxmappa Hosamani, Kevin R Keet, Minjoung Go, Evelyn Ling, David B. Larson, Curtis Langlotz, Roxana Daneshjou, Jason Hom, Sanmi Koyejo, Emily Alsentzer, Akshay S. Chaudhari
Abstract:
With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges. Across 10 state-of-the-art LMs spanning open-source and proprietary models, MedVAL distillation significantly improves (p < 0.001) alignment with physicians across seen and unseen tasks, increasing average F1 scores from 66% to 83%. Despite strong baseline performance, MedVAL improves the best-performing proprietary LM (GPT-4o) by 8% without training on physician-labeled data, demonstrating a performance statistically non-inferior to a single human expert (p < 0.001). To support a scalable, risk-aware pathway towards clinical integration, we open-source: 1) Codebase (https://github.com/StanfordMIMI/MedVAL), 2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), 3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B). Our benchmark provides evidence of LMs approaching expert-level ability in validating AI-generated medical text.
Chinese Summary: MedVAL提出了一种无需医生标注的自监督蒸馏方法,通过合成数据训练语言模型评估医疗文本准确性,在多项临床任务中显著提升了与专家判断的一致性。
English Summary: MedVAL introduces a self-supervised distillation method that trains language models to evaluate medical text accuracy without physician labels, significantly improving alignment with expert assessments across diverse clinical tasks.
Authors:Xiangrui Liu, Man Luo, Agneet Chatterjee, Hua Wei, Yezhou Yang
Abstract:
Hallucination is a long-standing problem that has been actively investigated in Vision-Language Models (VLMs). Existing research commonly attributes hallucinations to technical limitations or sycophancy bias, where the latter means the models tend to generate incorrect answers to align with user expectations. However, these explanations primarily focus on technical or externally driven factors, may have neglected the possibility that hallucination behaviours might mirror cognitive biases observed in human psychology. In this work, we introduce a psychological taxonomy, categorizing VLMs' hallucination behaviours, including sycophancy, logical inconsistency, and a newly identified VLMs behaviour: authority bias. To systematically analyze these behaviours, we design AIpsych, a scalable benchmark that reveals psychological tendencies in model response patterns. Leveraging this benchmark, we investigate how variations in model architecture and parameter size influence model behaviour when responding to strategically manipulated questions. Our experiments reveal that as model size increases, VLMs exhibit stronger sycophantic tendencies but reduced authority bias, suggesting increasing competence but a potential erosion of response integrity. A human subject study further validates our hypotheses and highlights key behavioural differences between VLMs and human respondents. This work suggests a new perspective for understanding hallucination in VLMs and highlights the importance of integrating psychological principles into model evaluation.The benchmark is available at https://github.com/lxrswdd/AIpsych.
中文: 本研究提出视觉语言模型的幻觉可能源于心理偏见而非仅技术局限,通过新分类法和AIpsych基准分析奉承与权威偏见等行为,发现模型越大奉承倾向越强但权威偏见减弱。
English: This study proposes that hallucinations in Vision-Language Models (VLMs) may stem from psychological biases rather than just technical limitations, introducing a new taxonomy and benchmark called AIpsych to analyze behaviors like sycophancy and authority bias, revealing that larger models show increased sycophancy but reduced authority bias.
Authors:Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li
Abstract:
Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends--thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.
中文: 本文提出RLVER框架,通过模拟用户的可验证情感奖励来增强大语言模型的共情能力,在保持逻辑推理能力的同时将情感智能评分从13.3提升至79.2。
English: This paper introduces RLVER, the first reinforcement learning framework using verifiable emotion rewards from simulated users to significantly enhance empathetic abilities in large language models while preserving their logical reasoning capabilities.
Authors:Sergii Kavun
Abstract:
This paper introduces KZImputer, a novel adaptive imputation method for univariate time series designed for short to medium-sized missed points (gaps) (1-5 points and beyond) with tailored strategies for segments at the start, middle, or end of the series. KZImputer employs a hybrid strategy to handle various missing data scenarios. Its core mechanism differentiates between gaps at the beginning, middle, or end of the series, applying tailored techniques at each position to optimize imputation accuracy. The method leverages linear interpolation and localized statistical measures, adapting to the characteristics of the surrounding data and the gap size. The performance of KZImputer has been systematically evaluated against established imputation techniques, demonstrating its potential to enhance data quality for subsequent time series analysis. This paper describes the KZImputer methodology in detail and discusses its effectiveness in improving the integrity of time series data. Empirical analysis demonstrates that KZImputer achieves particularly strong performance for datasets with high missingness rates (around 50% or more), maintaining stable and competitive results across statistical and signal-reconstruction metrics. The method proves especially effective in high-sparsity regimes, where traditional approaches typically experience accuracy degradation.
中文: KZImputer是一种自适应单变量时间序列填补方法,针对不同位置的缺失段采用定制策略,尤其在高度缺失情况下表现卓越,优于传统方法。
English: KZImputer is an adaptive univariate time series imputation method that uses tailored strategies for gaps at different positions and excels particularly in high-missingness scenarios, outperforming traditional techniques.
Authors:Ana Vasilcoiu, Ivona Najdenkoska, Zeno Geradts, Marcel Worring
Abstract:
The rapid advancement of diffusion-based image generators has made it increasingly difficult to distinguish generated from real images. This can erode trust in digital media, making it critical to develop generalizable detectors for generated images. Recent methods leverage diffusion denoising cues, but mainly focus on single-step reconstruction errors, ignoring the inherent sequential nature of the denoising process. In this work, we propose LATTE - Latent Trajectory Embedding - a novel approach that models the evolution of latent embeddings across several denoising timesteps. By modeling the trajectory of such embeddings rather than single-step errors, LATTE captures subtle, discriminative patterns that distinguish real from generated images. Each latent is refined by employing our latent-visual feature refinement module and aggregated into a unified representation. Afterwards, it is fused with the visual features and finally passed into a lightweight classifier. Our experiments demonstrate that LATTE surpasses the baselines on several established benchmarks, such as GenImage and DiffusionFake. Moreover, it demonstrates strong performance in cross-generator and cross-datasets settings, highlighting the potential of using the trajectory of latent embeddings for generated image detection. The code is available on the following link: https://github.com/AnaMVasilcoiu/LATTE-Diffusion-Detector.
Chinese: LATTE提出了一种通过建模多步去噪过程中潜在嵌入轨迹的新方法,用于检测AI生成图像,在跨生成器和跨数据集的场景中表现出卓越性能。
English: LATTE introduces a novel method for detecting AI-generated images by modeling the trajectory of latent embeddings across multiple denoising steps, achieving superior performance in cross-generator and cross-dataset scenarios.
Authors:Ana Vasilcoiu, Ivona Najdenkoska, Zeno Geradts, Marcel Worring
Abstract:
The rapid advancement of diffusion-based image generators has made it increasingly difficult to distinguish generated from real images. This erodes trust in digital media, making it critical to develop generated image detectors that remain reliable across different generators. While recent approaches leverage diffusion denoising cues, they typically rely on single-step reconstruction errors and overlook the sequential nature of the denoising process. In this work, we propose LATTE - LATent Trajectory Embedding - a novel approach that models the evolution of latent embeddings across multiple denoising steps. Instead of treating each denoising step in isolation, LATTE captures the trajectory of these representations, revealing subtle and discriminative patterns that distinguish real from generated images. Experiments on several benchmarks, such as GenImage, Chameleon, and Diffusion Forensics, show that LATTE achieves superior performance, especially in challenging cross-generator and cross-dataset scenarios, highlighting the potential of latent trajectory modeling. The code is available on the following link: https://github.com/AnaMVasilcoiu/LATTE-Diffusion-Detector.
Chinese: LATTE提出了一种通过建模多步去噪过程中潜在嵌入轨迹的新方法,用于检测AI生成图像,在跨生成器和跨数据集的场景中表现出卓越性能。
English: LATTE introduces a novel method for detecting AI-generated images by modeling the trajectory of latent embeddings across multiple denoising steps, achieving superior performance in cross-generator and cross-dataset scenarios.
Authors:Yizhou Wang, Lingzhi Zhang, Yue Bai, Mang Tik Chiu, Zhengmian Hu, Mingyuan Zhang, Qihua Dong, Yu Yin, Sohrab Amirghodsi, Yun Fu
Abstract:
Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a brand new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP). In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation. Then we select the trial with the lowest perplexity score viewed as the most probable and reliable trial path given the model's capacity. The trial number is negatively correlated with the prediction confidence, i.e., the less confident the model is, the more trials it should sample. This is consistent with human beings' behaviour: when feeling uncertain or unconfident, one tends to think more creatively, exploring multiple thinking paths, to cautiously select the path one feels most confident about. Extensive experiments on both LLMs and MLLMs show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin. Moreover, the integration of CNTP with self consistency can further improve over vanilla self consistency. We believe our proposed CNTP has the potential to become one of the default choices for LLM decoding. Code is available at https://github.com/wyzjack/CNTP.
中文摘要:本文提出谨慎下一词预测(CNTP)这一无需训练的解码策略,当模型预测不确定性较高时并行采样多个候选路径,并依据困惑度选择最优路径,在各类大语言模型任务中显著优于现有标准解码方法。
English Summary: The paper introduces Cautious Next Token Prediction (CNTP), a training-free decoding strategy that samples multiple token paths when model uncertainty is high and selects the most reliable one based on perplexity, significantly outperforming standard decoding methods across various language models.
Authors:Zipeng Qiu
Abstract:
Open-domain table question answering traditionally relies on a two-stage pipeline: static table retrieval followed by a closed-domain answer. In contrast, we propose an end-to-end agentic framework that embeds multi-turn tool calls-using a BM25+-based search API and a SQLite SQL executor-directly into a large language model. To further adapt a compact 4B-parameter model, we introduce a two-stage fine-tuning process: supervised cold-start on easy questions, then Async GRPO reinforcement learning on harder cases with LoRA adapters and a rollout buffer. This unified approach enables the model to jointly retrieve, reason, and execute queries, yielding a dramatic accuracy improvement from single-digit zero-shot performance to over 0.86 exact match on a held-out test set. Our results underscore the effectiveness of integrating structured tool calls with targeted RL fine-tuning for scalable, accurate table QA. The code is available at https://github.com/TabibitoQZP/OpenTableR1.
中文: 该研究提出了一种端到端的智能框架,将结构化工具调用与两阶段微调过程结合到大型语言模型中,使表格问答准确率从接近零显著提升至超过0.86的精确匹配度。
English: The study introduces an end-to-end agentic framework that integrates structured tool calls and a two-stage fine-tuning process into a large language model, significantly improving table question answering accuracy from near-zero to over 0.86 exact match.
Authors:Rongxin Ouyang, Chang Chu, Zhikuang Xin, Xiangyao Ma
Abstract:
Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information in layouts. To bridge the gap, we introduce PDFMathTranslate, the world's first open-source software for translating scientific documents while preserving layouts. Leveraging the most recent advances in large language models and precise layout detection, we contribute to the community with key improvements in precision, flexibility, and efficiency. The work has been open-sourced at https://github.com/byaidu/pdfmathtranslate with more than 222k downloads.
中文: PDFMathTranslate 是首个开源软件,能在翻译科学文献时保持版面布局,弥补了以往忽视布局信息的问题,并利用先进语言模型和布局检测技术提升了精确性、灵活性和效率。
English: PDFMathTranslate is the first open-source software that translates scientific documents while preserving their layouts, addressing previous neglect of layout information and enhancing precision, flexibility, and efficiency through advanced language models and layout detection.
Authors:Haiqing Li, Yuzhi Guo, Feng Jiang, Thao M. Dang, Hehuan Ma, Qifeng Zhou, Jean Gao, Junzhou Huang
Abstract:
Early-stage scoliosis is often difficult to detect, particularly in adolescents, where delayed diagnosis can lead to serious health issues. Traditional X-ray-based methods carry radiation risks and rely heavily on clinical expertise, limiting their use in large-scale screenings. To overcome these challenges, we propose a Text-Guided Multi-Instance Learning Network (TG-MILNet) for non-invasive scoliosis detection using gait videos. To handle temporal misalignment in gait sequences, we employ Dynamic Time Warping (DTW) clustering to segment videos into key gait phases. To focus on the most relevant diagnostic features, we introduce an Inter-Bag Temporal Attention (IBTA) mechanism that highlights critical gait phases. Recognizing the difficulty in identifying borderline cases, we design a Boundary-Aware Model (BAM) to improve sensitivity to subtle spinal deviations. Additionally, we incorporate textual guidance from domain experts and large language models (LLM) to enhance feature representation and improve model interpretability. Experiments on the large-scale Scoliosis1K gait dataset show that TG-MILNet achieves state-of-the-art performance, particularly excelling in handling class imbalance and accurately detecting challenging borderline cases. The code is available at https://github.com/lhqqq/TG-MILNet
中文: 本文提出TG-MILNet,一种利用步态视频和文本引导的非侵入性脊柱侧弯检测方法,通过处理时间错位和类别不平衡问题,显著提升了对临界病例的检测灵敏度,实现了最优性能。
English: This paper introduces TG-MILNet, a non-invasive method using gait videos and textual guidance to detect early-stage scoliosis, which achieves state-of-the-art performance by addressing temporal misalignment and class imbalance while improving sensitivity to borderline cases.
Authors:Haiqing Li, Yuzhi Guo, Feng Jiang, Thao M. Dang, Hehuan Ma, Qifeng Zhou, Jean Gao, Junzhou Huang
Abstract:
Early-stage scoliosis is often difficult to detect, particularly in adolescents, where delayed diagnosis can lead to serious health issues. Traditional X-ray-based methods carry radiation risks and rely heavily on clinical expertise, limiting their use in large-scale screenings. To overcome these challenges, we propose a Text-Guided Multi-Instance Learning Network (TG-MILNet) for non-invasive scoliosis detection using gait videos. To handle temporal misalignment in gait sequences, we employ Dynamic Time Warping (DTW) clustering to segment videos into key gait phases. To focus on the most relevant diagnostic features, we introduce an Inter-Bag Temporal Attention (IBTA) mechanism that highlights critical gait phases. Recognizing the difficulty in identifying borderline cases, we design a Boundary-Aware Model (BAM) to improve sensitivity to subtle spinal deviations. Additionally, we incorporate textual guidance from domain experts and large language models (LLM) to enhance feature representation and improve model interpretability. Experiments on the large-scale Scoliosis1K gait dataset show that TG-MILNet achieves state-of-the-art performance, particularly excelling in handling class imbalance and accurately detecting challenging borderline cases. The code is available at https://github.com/lhqqq/TG-MILNet
中文: 本文提出TG-MILNet,一种利用步态视频和文本引导的非侵入性脊柱侧弯检测方法,通过处理时间错位和类别不平衡问题,显著提升了对临界病例的检测灵敏度,实现了最优性能。
English: This paper introduces TG-MILNet, a non-invasive method using gait videos and textual guidance to detect early-stage scoliosis, which achieves state-of-the-art performance by addressing temporal misalignment and class imbalance while improving sensitivity to borderline cases.
Authors:Huihui Xu, Yuanpeng Nie, Hualiang Wang, Ying Chen, Wei Li, Junzhi Ning, Lihao Liu, Hongqiu Wang, Lei Zhu, Jiyao Liu, Xiaomeng Li, Junjun He
Abstract:
Medical Image Grounding (MIG), which involves localizing specific regions in medical images based on textual descriptions, requires models to not only perceive regions but also deduce spatial relationships of these regions. Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations, which are expensive and time-consuming to acquire. Recently, DeepSeek-R1 demonstrated that Large Language Models (LLMs) can acquire reasoning abilities through Group Relative Policy Optimization (GRPO) without requiring CoT annotations. In this paper, we adapt the GRPO reinforcement learning framework to VLMs for Medical Image Grounding. We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations. Specifically, we introduce Spatial-Semantic Rewards, which combine spatial accuracy reward and semantic consistency reward to provide nuanced feedback for both spatially positive and negative completions. Additionally, we propose to use the Chain-of-Box template, which integrates visual information of referring bounding boxes into the reasoning process, enabling the model to explicitly reason about spatial regions during intermediate steps. Experiments on three datasets MS-CXR, ChestX-ray8, and M3D-RefSeg demonstrate that our method achieves state-of-the-art performance in Medical Image Grounding. Ablation studies further validate the effectiveness of each component in our approach. Code, checkpoints, and datasets are available at https://github.com/bio-mlhui/MedGround-R1
中文: 本文提出一种用于医学图像定位的强化学习方法,通过空间语义奖励和链式边界框推理模板避免了对昂贵思维链标注的依赖,在多个数据集上实现了最先进的性能。
English: This paper introduces a reinforcement learning method for Medical Image Grounding that eliminates the need for costly Chain-of-Thought annotations by using Spatial-Semantic Rewards and Chain-of-Box reasoning templates, achieving state-of-the-art performance across multiple datasets.
Authors:Seshu Tirupathi, Dhaval Salwala, Elizabeth Daly, Inge Vejsbjerg
Abstract:
As Large Language Models (LLMs) continue to be increasingly applied across various domains, their widespread adoption necessitates rigorous monitoring to prevent unintended negative consequences and ensure robustness. Furthermore, LLMs must be designed to align with human values, like preventing harmful content and ensuring responsible usage. The current automated systems and solutions for monitoring LLMs in production are primarily centered on LLM-specific concerns like hallucination etc, with little consideration given to the requirements of specific use-cases and user preferences. This paper introduces GAF-Guard, a novel agentic framework for LLM governance that places the user, the use-case, and the model itself at the center. The framework is designed to detect and monitor risks associated with the deployment of LLM based applications. The approach models autonomous agents that identify risks, activate risk detection tools, within specific use-cases and facilitate continuous monitoring and reporting to enhance AI safety, and user expectations. The code is available at https://github.com/IBM/risk-atlas-nexus-demos/tree/main/gaf-guard.
中文: GAF-Guard是一种创新的智能体框架,通过以用户需求、特定用例和模型风险为核心来加强大型语言模型的治理,确保人工智能的安全性和负责任的应用。
English: GAF-Guard is a novel agentic framework designed to enhance LLM governance by focusing on user needs, specific use-cases, and model risks to ensure AI safety and responsible deployment.
Authors:Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding
Abstract:
Achieving human-like reasoning capabilities in Multimodal Large Language Models (MLLMs) has long been a goal. Current methods primarily focus on synthesizing positive rationales, typically relying on manual annotations or complex systems. Moreover, they often overlook negative reasoning, which limits the model's generalization ability and robustness in multimodal inference. To address this gap, we propose a novel framework: \textbf{S}elf-Aligning \textbf{M}ultimodal Reasoning with \textbf{A}nswer-O\textbf{r}iented Chain-of-\textbf{T}hought (SMART). SMART employs an answer-oriented chain-of-thought (AoT) prompt to automatically construct high-quality data. Drawing inspiration from human proof-based strategies, AoT leverages both correct and incorrect answers to extract key visual information that links questions and answers. When provided with correct answers, the model produces strong positive rationales. Conversely, when correct answers are replaced with incorrect alternatives, the model generates an erroneous yet compelling reasoning path, serving as a form of discriminative negative rationale. Models trained with AoT-generated data outperform those trained on manually annotated datasets, demonstrating superior reasoning capabilities. Consequently, SMART establishes an iterative generation-optimization method that continually enhances the model's reasoning skills. Experiments indicate that the SMART framework significantly improves various MLLMs, regardless of model architecture, parameter size, or pre-training dataset. The code is available at https://github.com/WentaoTan/SMART.
中文摘要:SMART框架通过答案导向的思维链方法,利用正确和错误答案自动生成正负推理依据,无需人工标注即可显著提升多模态模型的推理能力。
English Summary: The SMART framework introduces an answer-oriented chain-of-thought approach that automatically generates both positive and negative rationales using correct and incorrect answers, significantly enhancing multimodal reasoning capabilities without manual annotations.
Authors:Zhiyi Hou, Enhui Ma, Fang Li, Zhiyi Lai, Kalok Ho, Zhanqian Wu, Lijun Zhou, Long Chen, Chitian Sun, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Kaicheng Yu
Abstract:
Autonomous driving has seen significant progress, driven by extensive real-world data. However, in long-tail scenarios, accurately predicting the safety of the ego vehicle's future motion remains a major challenge due to uncertainties in dynamic environments and limitations in data coverage. In this work, we aim to explore whether it is possible to enhance the motion risk prediction capabilities of Vision-Language Models (VLM) by synthesizing high-risk motion data. Specifically, we introduce a Bird's-Eye View (BEV) based motion simulation method to model risks from three aspects: the ego-vehicle, other vehicles, and the environment. This allows us to synthesize plug-and-play, high-risk motion data suitable for VLM training, which we call DriveMRP-10K. Furthermore, we design a VLM-agnostic motion risk estimation framework, named DriveMRP-Agent. This framework incorporates a novel information injection strategy for global context, ego-vehicle perspective, and trajectory projection, enabling VLMs to effectively reason about the spatial relationships between motion waypoints and the environment. Extensive experiments demonstrate that by fine-tuning with DriveMRP-10K, our DriveMRP-Agent framework can significantly improve the motion risk prediction performance of multiple VLM baselines, with the accident recognition accuracy soaring from 27.13% to 88.03%. Moreover, when tested via zero-shot evaluation on an in-house real-world high-risk motion dataset, DriveMRP-Agent achieves a significant performance leap, boosting the accuracy from base_model's 29.42% to 68.50%, which showcases the strong generalization capabilities of our method in real-world scenarios.
中文: 本研究提出DriveMRP-10K合成高风险运动数据集和DriveMRP-Agent框架,通过增强视觉语言模型的运动风险预测能力,显著提升了事故识别准确率,并在真实场景中展现出强大的泛化性能。
English: This research introduces DriveMRP-10K, a synthesized high-risk motion dataset, and the DriveMRP-Agent framework to enhance Vision-Language Models' motion risk prediction, significantly improving accident recognition accuracy and demonstrating strong generalization in real-world scenarios.
Authors:Yuqi Li, Chuanguang Yang, Hansheng Zeng, Zeyu Dong, Zhulin An, Yongjun Xu, Yingli Tian, Hao Wu
Abstract:
Spatiotemporal forecasting tasks, such as traffic flow, combustion dynamics, and weather forecasting, often require complex models that suffer from low training efficiency and high memory consumption. This paper proposes a lightweight framework, Spectral Decoupled Knowledge Distillation (termed SDKD), which transfers the multi-scale spatiotemporal representations from a complex teacher model to a more efficient lightweight student network. The teacher model follows an encoder-latent evolution-decoder architecture, where its latent evolution module decouples high-frequency details and low-frequency trends using convolution and Transformer (global low-frequency modeler). However, the multi-layer convolution and deconvolution structures result in slow training and high memory usage. To address these issues, we propose a frequency-aligned knowledge distillation strategy, which extracts multi-scale spectral features from the teacher's latent space, including both high and low frequency components, to guide the lightweight student model in capturing both local fine-grained variations and global evolution patterns. Experimental results show that SDKD significantly improves performance, achieving reductions of up to 81.3% in MSE and in MAE 52.3% on the Navier-Stokes equation dataset. The framework effectively captures both high-frequency variations and long-term trends while reducing computational complexity. Our codes are available at https://github.com/itsnotacie/SDKD
中文摘要:本文提出了一种名为谱解耦知识蒸馏(SDKD)的轻量级框架,通过将复杂教师模型的多尺度时空知识迁移到高效学生网络中,在显著提升预测精度的同时降低了计算复杂度。
English Summary: This paper introduces a lightweight framework called Spectral Decoupled Knowledge Distillation (SDKD) that transfers multi-scale spatiotemporal knowledge from a complex teacher model to an efficient student network, significantly improving forecasting accuracy while reducing computational complexity.
Authors:Fardin Saad, Pradeep K. Murukannaiah, Munindar P. Singh
Abstract:
The Theory of Mind (ToM) refers to an agent's capacity to infer the mental states of other agents. ToM is essential for effective collaboration. To assess ToM in a dynamic, goal-oriented, and collaborative environment, we introduce a novel task, Instruction Inference, in which an agent assists a principal in reaching a goal by interpreting indirect or ambiguous instructions. We present Tomcat, an LLM-based agent, designed to exhibit ToM reasoning in interpreting and responding to the principal's instructions. We implement two variants of Tomcat. One, dubbed Fs-CoT, is based on a small number of examples (i.e., few-shot or Fs) demonstrating the requisite structured reasoning (i.e., chain-of-thought or CoT). One, dubbed CP, relies on commonsense knowledge and information about the problem (i.e., commonsense prompt or CP). We realized both variants of Tomcat on three leading large language models (LLMs), namely, GPT-4o, DeepSeek-R1, and Gemma-3-27B. To evaluate the effectiveness of Tomcat, we conducted a study with 52 human participants in which we provided participants with the same information as the CP variant of Tomcat. We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities of Tomcat and our study participants. We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants, underscoring its ToM potential for human-AI collaboration.
中文摘要:本研究提出了基于大语言模型的Tomcat智能体,通过指令推理任务展示心理理论推理能力,其中Fs-CoT变体在协作场景中实现了与人类相当的表现水平。
English Summary: This study introduces Tomcat, an LLM-based agent designed to demonstrate Theory of Mind reasoning through instruction inference tasks, with its Fs-CoT variant achieving human-comparable performance in collaborative scenarios.
Authors:Jianping Zhao, Qiong Zhou, Tian Wang, Yusi Fan, Qian Yang, Li Jiao, Chang Liu, Zhehao Guo, Qi Lu, Fengfeng Zhou, Ruochi Zhang
Abstract:
MolProphecy is a human-in-the-loop (HITL) multi-modal framework designed to integrate chemists' domain knowledge into molecular property prediction models. While molecular pre-trained models have enabled significant gains in predictive accuracy, they often fail to capture the tacit, interpretive reasoning central to expert-driven molecular design. To address this, MolProphecy employs ChatGPT as a virtual chemist to simulate expert-level reasoning and decision-making. The generated chemist knowledge is embedded by the large language model (LLM) as a dedicated knowledge representation and then fused with graph-based molecular features through a gated cross-attention mechanism, enabling joint reasoning over human-derived and structural features. Evaluated on four benchmark datasets (FreeSolv, BACE, SIDER, and ClinTox), MolProphecy outperforms state-of-the-art (SOTA) models, achieving a 15.0 percent reduction in RMSE on FreeSolv and a 5.39 percent improvement in AUROC on BACE. Analysis reveals that chemist knowledge and structural features provide complementary contributions, improving both accuracy and interpretability. MolProphecy offers a practical and generalizable approach for collaborative drug discovery, with the flexibility to incorporate real chemist input in place of the current simulated proxy--without the need for model retraining. The implementation is publicly available at https://github.com/zhangruochi/MolProphecy.
中文: MolProphecy是一个结合模拟化学家知识与分子结构特征的人机协同框架,通过在基准数据集上的卓越表现,显著提升了分子属性预测的准确性和可解释性。
English: MolProphecy is a human-in-the-loop framework that integrates simulated chemist knowledge via ChatGPT with molecular structural features, achieving superior performance on benchmark datasets by enhancing both prediction accuracy and interpretability.
Authors:Mohammad Hashemi, Hossein Amiri, Andreas Zufle
Abstract:
With the rapid growth and continual updates of geospatial data from diverse sources, geospatial foundation model pre-training for urban representation learning has emerged as a key research direction for advancing data-driven urban planning. Spatial structure is fundamental to effective geospatial intelligence systems; however, existing foundation models often lack the flexibility to reason about places, context-rich regions spanning multiple spatial granularities that may consist of many spatially and semantically related points of interest. To address this gap, we propose PlaceFM, a geospatial foundation model that captures place representations through a training-free, clustering-based approach. PlaceFM summarizes the entire point of interest graph constructed from U.S. Foursquare data, producing general-purpose region embeddings while automatically identifying places of interest. These embeddings can be directly integrated into geolocation data pipelines to support a variety of urban downstream tasks. Without the need for costly pre-training, PlaceFM provides a scalable and efficient solution for multi-granular geospatial analysis. Extensive experiments on two real-world prediction tasks, ZIP code-level population density and housing prices, demonstrate that PlaceFM not only outperforms most state-of-the-art graph-based geospatial foundation models but also achieves up to a 100x speedup in generating region-level representations on large-scale POI graphs. The implementation is available at https://github.com/mohammadhashemii/PlaceFM.
中文: PlaceFM是一种无需预训练的地理空间基础模型,通过基于聚类的无训练方法从兴趣点数据中生成可扩展的区域嵌入,在多项城市预测任务中优于现有模型,并实现了高达百倍的效率提升。
English: PlaceFM is a geospatial foundation model that uses a training-free clustering approach to generate scalable region embeddings from point-of-interest data, outperforming existing models in urban prediction tasks while achieving significant efficiency gains.
Authors:Geonwoo Cho, Jaegyun Im, Doyoon Kim, Sundong Kim
Abstract:
Designing effective task sequences is crucial for curriculum reinforcement learning (CRL), where agents must gradually acquire skills by training on intermediate tasks. A key challenge in CRL is to identify tasks that promote exploration, yet are similar enough to support effective transfer. While recent approach suggests comparing tasks via their Structural Causal Models (SCMs), the method requires access to ground-truth causal structures, an unrealistic assumption in most RL settings. In this work, we propose Causal-Paced Deep Reinforcement Learning (CP-DRL), a curriculum learning framework aware of SCM differences between tasks based on interaction data approximation. This signal captures task novelty, which we combine with the agent's learnability, measured by reward gain, to form a unified objective. Empirically, CP-DRL outperforms existing curriculum methods on the Point Mass benchmark, achieving faster convergence and higher returns. CP-DRL demonstrates reduced variance with comparable final returns in the Bipedal Walker-Trivial setting, and achieves the highest average performance in the Infeasible variant. These results indicate that leveraging causal relationships between tasks can improve the structure-awareness and sample efficiency of curriculum reinforcement learning. We provide the full implementation of CP-DRL to facilitate the reproduction of our main results at https://github.com/Cho-Geonwoo/CP-DRL.
中文: 本文提出因果节奏深度强化学习(CP-DRL),该课程学习框架通过任务间近似因果差异增强探索与迁移能力,在强化学习基准测试中实现了更优的性能和样本效率。
English: This paper introduces Causal-Paced Deep Reinforcement Learning (CP-DRL), a curriculum learning framework that leverages approximated causal differences between tasks to enhance exploration and transfer, achieving superior performance and sample efficiency in reinforcement learning benchmarks.
Authors:Aoming Liu, Reuben Tan, Boqing Gong, Bryan A. Plummer
Abstract:
Prior Vision Language Model (VLM) token pruning reduces computation by eliminating attention and feed-forward operations for pruned tokens while maintaining all operations for critical tokens. However, this binary approach conflates token/operation redundancy - critical operations may be removed along with discarded tokens, while preserved tokens retain all potentially redundant operations. To surgically eliminate redundant operations while preserving critical ones, we propose Greedily Sorted Operation Pruning (GSOP), a data-driven method that directly prunes operations rather than tokens. GSOP first decomposes a VLM decoder's computations into atomic operations along three dimensions: token groups, layer positions, and computation modules. GSOP determines the pruning order of operations through greedy sorting: GSOP iteratively selects the redundant operation that incurs minimal performance drop considering previously pruned operations. Different computational budgets can be accommodated without re-searching by simply pruning operations according to this order until the desired budget is met. GSOP enhances sorting efficiency through: a) leveraging historical operation rankings to avoid redundant evaluations; b) excluding the ``free-to-prune" and ``danger-to-prune" operations from sorting. GSOP achieves compelling efficiency-performance tradeoffs, reducing computation by 70% with only 4% performance loss while maintaining up to 18% higher performance than state-of-the-art methods when transferred across diverse VLMs and tasks. Real GPU efficiency evaluations confirm its practical value. The code is in https://github.com/zxcvfd13502/GSOP.
Chinese: GSOP是一种数据驱动方法,通过贪婪排序对视觉语言模型中的冗余操作进行精准剪枝,在减少70%计算量的同时仅造成4%性能损失,并在跨模型和任务中保持比现有最优方法高18%的性能表现。
English: GSOP is a data-driven method that surgically prunes redundant operations in Vision Language Models by greedily sorting operations to minimize performance loss, achieving up to 70% computation reduction with only 4% performance degradation while outperforming state-of-the-art methods.
Authors:Vineet Kumar Rakesh, Soumya Mazumdar, Research Pratim Maity, Sarbajit Pal, Amitabha Das, Tapas Samanta
Abstract:
Talking Head Generation (THG) has emerged as a transformative technology in computer vision, enabling the synthesis of realistic human faces synchronized with image, audio, text, or video inputs. This paper provides a comprehensive review of methodologies and frameworks for talking head generation, categorizing approaches into 2D--based, 3D--based, Neural Radiance Fields (NeRF)--based, diffusion--based, parameter-driven techniques and many other techniques. It evaluates algorithms, datasets, and evaluation metrics while highlighting advancements in perceptual realism and technical efficiency critical for applications such as digital avatars, video dubbing, ultra-low bitrate video conferencing, and online education. The study identifies challenges such as reliance on pre--trained models, extreme pose handling, multilingual synthesis, and temporal consistency. Future directions include modular architectures, multilingual datasets, hybrid models blending pre--trained and task-specific layers, and innovative loss functions. By synthesizing existing research and exploring emerging trends, this paper aims to provide actionable insights for researchers and practitioners in the field of talking head generation. For the complete survey, code, and curated resource list, visit our GitHub repository: https://github.com/VineetKumarRakesh/thg.
中文: 本文全面综述了说话头生成技术,评估了其应用、挑战及未来方向,为研究人员和从业者提供了实用指导。
English: This paper offers a comprehensive review of talking head generation methods, evaluating their applications, challenges, and future directions to guide researchers and practitioners in the field.
Authors:Lindong Xie, Genghui Li, Zhenkun Wang, Edward Chung, Maoguo Gong
Abstract:
Surrogate-assisted evolutionary algorithms (SAEAs) are a key tool for addressing costly optimization tasks, with their efficiency being heavily dependent on the selection of surrogate models and infill sampling criteria. However, designing an effective dynamic selection strategy for SAEAs is labor-intensive and requires substantial domain knowledge. To address this challenge, this paper proposes LLM-SAEA, a novel approach that integrates large language models (LLMs) to configure both surrogate models and infill sampling criteria online. Specifically, LLM-SAEA develops a collaboration-of-experts framework, where one LLM serves as a scoring expert (LLM-SE), assigning scores to surrogate models and infill sampling criteria based on their optimization performance, while another LLM acts as a decision expert (LLM-DE), selecting the appropriate configurations by analyzing their scores along with the current optimization state. Experimental results demonstrate that LLM-SAEA outperforms several state-of-the-art algorithms across standard test cases. The source code is publicly available at https://github.com/ForrestXie9/LLM-SAEA.
中文: 本文提出LLM-SAEA,一种新颖的代理辅助进化算法,通过集成大语言模型在线动态选择代理模型和填充采样标准,实验证明其在标准测试案例中优于多种先进算法。
English: This paper introduces LLM-SAEA, a novel surrogate-assisted evolutionary algorithm that leverages large language models to dynamically select surrogate models and infill sampling criteria online, demonstrating superior performance over existing methods in experimental tests.
Authors:Chi Zhang, Yu Dong, Yang Wang, Yuetong Han, Guihua Shan, Bixia Tang
Abstract:
Circular genome visualizations are essential for exploring structural variants and gene regulation. However, existing tools often require complex scripting and manual configuration, making the process time-consuming, error-prone, and difficult to learn. To address these challenges, we introduce AuraGenome, an LLM-powered framework for rapid, reusable, and scalable generation of multi-layered circular genome visualizations. AuraGenome combines a semantic-driven multi-agent workflow with an interactive visual analytics system. The workflow employs seven specialized LLM-driven agents, each assigned distinct roles such as intent recognition, layout planning, and code generation, to transform raw genomic data into tailored visualizations. The system supports multiple coordinated views tailored for genomic data, offering ring, radial, and chord-based layouts to represent multi-layered circular genome visualizations. In addition to enabling interactions and configuration reuse, the system supports real-time refinement and high-quality report export. We validate its effectiveness through two case studies and a comprehensive user study. AuraGenome is available at: https://github.com/Darius18/AuraGenome.
中文摘要:AuraGenome是一个基于大语言模型的框架,通过多智能体工作流和交互式分析系统自动生成多层环形基因组可视化,无需复杂编程即可快速创建可定制的高质量基因组图谱。
English Summary: AuraGenome is an LLM-powered framework that automates circular genome visualization through a multi-agent workflow and interactive system, enabling rapid generation of customizable multi-layered genomic diagrams without complex scripting.
Authors:Yuqi Wu, Wenzhao Zheng, Jie Zhou, Jiwen Lu
Abstract:
Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs. Code is available at: https://github.com/YkiWu/Point3R.
中文摘要:Point3R提出了一种在线密集三维重建框架,通过显式空间指针内存直接关联三维场景结构,有效整合最新观测数据,在保持低训练成本的同时实现了优越的性能。
English Summary: Point3R introduces an online framework for dense 3D reconstruction using explicit spatial pointer memory that directly associates with 3D structures, enabling efficient integration of new observations while maintaining competitive performance with low training costs.
Authors:Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Hengshuang Zhao, Xiang Bai
Abstract:
Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3$\times$ compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at https://github.com/H-EmbodVis/EasyCache.
中文: EasyCache是一种无需训练的加速框架,通过动态重用视频扩散模型中的变换向量,在保持高视觉保真度的同时,将推理速度提升最高3.3倍,并比现有最佳方法的PSNR指标显著提高36%。
English: EasyCache is a training-free acceleration framework that dynamically reuses transformation vectors in video diffusion models, achieving up to 3.3× faster inference while improving visual quality with a 36% PSNR gain over prior methods.
Authors:Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
Abstract:
Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models--even small ones--achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.
中文: 尽管多选题基准便于评估,但常存在无需理解问题即可作答的捷径,而采用现代语言模型进行答案匹配的生成式评估不仅与人工评分高度一致,还显著改变了模型排名。
English: Multiple choice benchmarks, while convenient, often contain shortcuts that allow answers without understanding the question, but generative evaluation through answer matching using modern language models achieves near-perfect agreement with human grading and significantly alters model rankings.
Authors:Purbesh Mitra, Sennur Ulukus
Abstract:
Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ -- an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8\% and 3.3\% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15\% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at https://github.com/purbeshmitra/MOTIF and https://huggingface.co/purbeshmitra/MOTIF, respectively.
中文摘要:提出的MOTIF方法通过强化学习实现模块化多轮思考,有效提升大语言模型的推理能力,在基准测试中相比传统方法以更高样本效率显著提高了准确率。
English Summary: The proposed MOTIF method enhances large language models' reasoning by enabling modular, multi-round thinking through reinforcement learning, significantly improving accuracy on benchmarks with greater sample efficiency than previous approaches.
Authors:Kunyu Zhang, Qiang Li, Shujian Yu
Abstract:
Recent evidence suggests that modeling higher-order interactions (HOIs) in functional magnetic resonance imaging (fMRI) data can enhance the diagnostic accuracy of machine learning systems. However, effectively extracting and utilizing HOIs remains a significant challenge. In this work, we propose MvHo-IB, a novel multi-view learning framework that integrates both pairwise interactions and HOIs for diagnostic decision-making, while automatically compressing task-irrelevant redundant information. MvHo-IB introduces several key innovations: (1) a principled method that combines O-information from information theory with a matrix-based Renyi alpha-order entropy estimator to quantify and extract HOIs, (2) a purpose-built Brain3DCNN encoder to effectively utilize these interactions, and (3) a new multi-view learning information bottleneck objective to enhance representation learning. Experiments on three benchmark fMRI datasets demonstrate that MvHo-IB achieves state-of-the-art performance, significantly outperforming previous methods, including recent hypergraph-based techniques. The implementation of MvHo-IB is available at https://github.com/zky04/MvHo-IB.
中文: 提出的MvHo-IB框架通过结合高阶交互作用的多视角学习方法与信息瓶颈优化,显著提升了fMRI诊断准确性,在三个基准数据集上实现了最优性能。
English: The proposed MvHo-IB framework enhances fMRI diagnostic accuracy by integrating higher-order interactions through a novel multi-view learning approach with information bottleneck optimization, achieving state-of-the-art performance across three benchmark datasets.
Authors:Ziqi Miao, Yi Ding, Lijun Li, Jing Shao
Abstract:
With the emergence of strong vision language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: vision-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct vision-focused strategies, dynamically generating auxiliary images when necessary to construct a vision-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which achieves a toxicity score of 2.48 and an ASR of 22.2%. Code: https://github.com/Dtc7w3PQ/Visco-Attack.
Chinese: 多模态大语言模型面临以视觉为中心的越狱安全风险,VisCo攻击通过情境对话和动态图像生成有效诱导有害响应,在GPT-4o等模型上实现了高毒性和成功率。
English: Multimodal large language models face security risks from vision-centric jailbreaks, where the VisCo Attack uses contextual dialogue and dynamic image generation to effectively trigger harmful responses, achieving high toxicity and success rates against models like GPT-4o.
Authors:Gent Serifi, Marcel C. Bühler
Abstract:
We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. Creating such detailed face avatars from videos is a challenging problem and has numerous applications in augmented and virtual reality. While tremendous successes have been achieved for static faces, animatable avatars from monocular videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed 'HyperGaussians'. The higher dimensionality increases expressivity through conditioning on a learnable local embedding. However, splatting HyperGaussians is computationally expensive because it requires inverting a high-dimensional covariance matrix. We solve this by reparameterizing the covariance matrix, dubbed the 'inverse covariance trick'. This trick boosts the efficiency so that HyperGaussians can be seamlessly integrated into existing models. To demonstrate this, we plug in HyperGaussians into the state-of-the-art in fast monocular face avatars: FlashAvatar. Our evaluation on 19 subjects from 4 face datasets shows that HyperGaussians outperform 3DGS numerically and visually, particularly for high-frequency details like eyeglass frames, teeth, complex facial movements, and specular reflections.
Chinese: HyperGaussians 通过引入带有可学习嵌入的高维多变量高斯分布,增强了3D高斯泼溅技术,显著提升了可动画人脸化身的表达能力,在渲染精细细节和复杂面部运动方面优于现有方法。
English: HyperGaussians enhance 3D Gaussian Splatting by introducing high-dimensional multivariate Gaussians with learnable embeddings, significantly improving expressivity for animatable face avatars and outperforming existing methods in rendering fine details and complex facial movements.
Authors:Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee
Abstract:
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM's original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM's native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.
中文: DeSTA2.5-Audio是一种大型音频语言模型,采用自生成跨模态对齐策略,在无需任务特定调优的情况下保持骨干大语言模型的语言能力,同时实现强大的音频感知和指令跟随,在多项基准测试中展现出领先性能。
English: DeSTA2.5-Audio is a large audio language model that employs a self-generated cross-modal alignment strategy to maintain the backbone LLM's language abilities while achieving robust audio perception and instruction-following without task-specific tuning, demonstrating state-of-the-art performance across multiple benchmarks.
Authors:Mingxin Liu, Peiyuan Zhang, Yuan Liu, Wei Zhang, Yue Zhou, Ning Liao, Ziyang Gong, Junwei Luo, Zhirui Wang, Yi Yu, Xue Yang
Abstract:
The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semi-supervised methods using partial OBB annotations, and (3) weakly supervised methods using weak annotations such as horizontal boxes or points. However, these algorithms inevitably increase the cost of models in terms of annotation speed or annotation cost. To address this issue, we propose:(1) the first Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework based on partially weak annotations (horizontal boxes or single points), which can efficiently leverage large amounts of unlabeled data, significantly outperforming weakly supervised algorithms trained with partially weak annotations, also offers a lower cost solution; (2) Orientation-and-Scale-aware Student (OS-Student) model capable of learning orientation and scale information with only a small amount of orientation-agnostic or scale-agnostic weak annotations; and (3) Class-Agnostic Pseudo-Label Filtering strategy (CPF) to reduce the model's sensitivity to static filtering thresholds. Comprehensive experiments on DOTA-v1.0/v1.5/v2.0 and DIOR datasets demonstrate that our PWOOD framework performs comparably to, or even surpasses, traditional semi-supervised algorithms.
中文: 本文提出了一种基于部分弱标注的部分弱监督定向目标检测(PWOOD)框架,通过降低标注成本,在性能上达到甚至超越了传统半监督算法。
English: This paper introduces a Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework that utilizes partial weak annotations to reduce annotation costs while achieving performance comparable to or better than traditional semi-supervised methods.
Authors:Alex Colagrande, Paul Caillon, Eva Feillet, Alexandre Allauzen
Abstract:
Transformers have become the de facto standard for a wide range of tasks, from image classification to physics simulations. Despite their impressive performance, the quadratic complexity of standard Transformers in both memory and time with respect to the input length makes them impractical for processing high-resolution inputs. Therefore, several variants have been proposed, the most successful relying on patchification, downsampling, or coarsening techniques, often at the cost of losing the finest-scale details. In this work, we take a different approach. Inspired by state-of-the-art techniques in $n$-body numerical simulations, we cast attention as an interaction problem between grid points. We introduce the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale fashion. MANO maintains, in each attention head, a global receptive field and achieves linear time and memory complexity with respect to the number of grid points. Empirical results on image classification and Darcy flows demonstrate that MANO rivals state-of-the-art models such as ViT and Swin Transformer, while reducing runtime and peak memory usage by orders of magnitude. We open source our code for reproducibility at https://github.com/AlexColagrande/MANO.
Chinese: 多极注意力神经算子(MANO)借鉴n体数值模拟技术,采用基于距离的多尺度注意力机制,在图像分类等任务中性能媲美ViT和Swin Transformer等先进模型,同时实现线性时空复杂度并大幅降低计算资源消耗。
English: The Multipole Attention Neural Operator (MANO) introduces a distance-based multiscale attention mechanism inspired by n-body simulations, achieving linear complexity in time and memory while maintaining competitive performance with models like ViT and Swin Transformer across tasks such as image classification.
Authors:Mélanie Gaillochet, Mehrdad Noori, Sahar Dastani, Christian Desrosiers, Hervé Lombaert
Abstract:
Pixel-wise annotations are notoriously labourious and costly to obtain in the medical domain. To mitigate this burden, weakly supervised approaches based on bounding box annotations-much easier to acquire-offer a practical alternative. Vision foundation models have recently shown noteworthy segmentation performance when provided with prompts such as points or bounding boxes. Prompt learning exploits these models by adapting them to downstream tasks and automating segmentation, thereby reducing user intervention. However, existing prompt learning approaches depend on fully annotated segmentation masks. This paper proposes a novel framework that combines the representational power of foundation models with the annotation efficiency of weakly supervised segmentation. More specifically, our approach automates prompt generation for foundation models using only bounding box annotations. Our proposed optimization scheme integrates multiple constraints derived from box annotations with pseudo-labels generated by the prompted foundation model. Extensive experiments across multimodal datasets reveal that our weakly supervised method achieves an average Dice score of 84.90% in a limited data setting, outperforming existing fully-supervised and weakly-supervised approaches. The code is available at https://github.com/Minimel/box-prompt-learning-VFM.git
中文: 本文提出了一种弱监督分割框架,仅利用边界框标注自动生成视觉基础模型的提示,在有限数据设置下以84.90%的平均Dice分数超越了现有全监督和弱监督方法。
English: This paper introduces a weakly supervised segmentation framework that automates prompt generation for vision foundation models using only bounding box annotations, achieving superior performance over existing methods with an average Dice score of 84.90%.
Authors:Jiaxing Wang, Yifeng Yu, Jiahan Song, Bin Cao, Jing Fan, Ji Zhang
Abstract:
Next activity prediction represents a fundamental challenge for optimizing business processes in service-oriented architectures such as microservices environments, distributed enterprise systems, and cloud-native platforms, which enables proactive resource allocation and dynamic service composition. Despite the prevalence of sequence-based methods, these approaches fail to capture non-sequential relationships that arise from parallel executions and conditional dependencies. Even though graph-based approaches address structural preservation, they suffer from homogeneous representations and static structures that apply uniform modeling strategies regardless of individual process complexity characteristics. To address these limitations, we introduce RLHGNN, a novel framework that transforms event logs into heterogeneous process graphs with three distinct edge types grounded in established process mining theory. Our approach creates four flexible graph structures by selectively combining these edges to accommodate different process complexities, and employs reinforcement learning formulated as a Markov Decision Process to automatically determine the optimal graph structure for each specific process instance. RLHGNN then applies heterogeneous graph convolution with relation-specific aggregation strategies to effectively predict the next activity. This adaptive methodology enables precise modeling of both sequential and non-sequential relationships in service interactions. Comprehensive evaluation on six real-world datasets demonstrates that RLHGNN consistently outperforms state-of-the-art approaches. Furthermore, it maintains an inference latency of approximately 1 ms per prediction, representing a highly practical solution suitable for real-time business process monitoring applications. The source code is available at https://github.com/Joker3993/RLHGNN.
中文: RLHGNN是一种创新框架,通过强化学习自动构建最优异构图结构来预测业务流程中的下一活动,能同时捕捉顺序与非顺序关系并保持实时性能,在六个真实数据集上均优于现有方法。
English: RLHGNN is a novel framework that uses reinforcement learning to automatically construct optimal heterogeneous graph structures from event logs, enabling precise prediction of next activities in business processes by capturing both sequential and non-sequential relationships while maintaining real-time performance.
Authors:Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yang Zhao, Hongjin Qian, Zhicheng Dou
Abstract:
Complex information needs in real-world search scenarios demand deep reasoning and knowledge synthesis across diverse sources, which traditional retrieval-augmented generation (RAG) pipelines struggle to address effectively. Current reasoning-based approaches suffer from a fundamental limitation: they use a single model to handle both high-level planning and detailed execution, leading to inefficient reasoning and limited scalability. In this paper, we introduce HiRA, a hierarchical framework that separates strategic planning from specialized execution. Our approach decomposes complex search tasks into focused subtasks, assigns each subtask to domain-specific agents equipped with external tools and reasoning capabilities, and coordinates the results through a structured integration mechanism. This separation prevents execution details from disrupting high-level reasoning while enabling the system to leverage specialized expertise for different types of information processing. Experiments on four complex, cross-modal deep search benchmarks demonstrate that HiRA significantly outperforms state-of-the-art RAG and agent-based systems. Our results show improvements in both answer quality and system efficiency, highlighting the effectiveness of decoupled planning and execution for multi-step information seeking tasks. Our code is available at https://github.com/ignorejjj/HiRA.
Chinese: HiRA 提出了一种分层框架,将策略规划与专业执行分离,以优化复杂搜索任务,在答案质量和系统效率上均显著优于现有方法。
English: HiRA introduces a hierarchical framework that separates strategic planning from specialized execution to enhance complex search tasks, significantly outperforming existing systems in both answer quality and efficiency.
Authors:Xing Liu, Lizhuo Luo, Ming Tang, Chao Huang
Abstract:
Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device memory. Recent pipeline-based approaches have the potential to parallelize communication and computation, which helps reduce inference latency. However, the benefit diminishes when the inference request at the network edge is sparse, where pipeline is typically at low utilization. To enable efficient distributed LLM inference at the edge, we propose \textbf{FlowSpec}, a pipeline-parallel tree-based speculative decoding framework. FlowSpec incorporates three key mechanisms to improve decoding efficiency: 1) score-based step-wise verification prioritizes more important draft tokens to bring earlier accpeted tokens; 2) efficient draft management to prune invalid tokens while maintaining correct causal relationship during verification; 3) dynamic draft expansion strategies to supply high-quality speculative inputs. These techniques work in concert to enhance both pipeline utilization and speculative efficiency. We evaluate FlowSpec on a real-world testbed with other baselines. Experimental results demonstrate that our proposed framework significantly improves inference speed across diverse models and configurations, achieving speedup ratios 1.28$\times$-1.79$\times$ compared to baselines. Our code is publicly available at \href{https://github.com/Leosang-lx/FlowSpec#}{https://github.com/Leosang-lx/FlowSpec\#}
Chinese: FlowSpec是一种基于流水线并行和树形结构的推测解码框架,通过优先级验证、草稿管理和动态扩展机制提升边缘分布式大语言模型推理效率,实验显示其推理速度较基线方法提升1.28至1.79倍。
English: FlowSpec is a pipeline-parallel tree-based speculative decoding framework that enhances distributed LLM inference efficiency at the edge through prioritized verification, draft management, and dynamic expansion, achieving 1.28×-1.79× speedup over baselines.
Authors:Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, Andrei Lupu, Roberta Raileanu, Kelvin Niu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Shagun Sodhani, Alexander H. Miller, Abhishek Charnalia, Derek Dunfield, Carole-Jean Wu, Pontus Stenetorp, Nicola Cancedda, Jakob Nicolaus Foerster, Yoram Bachrach
Abstract:
AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents' performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.
中文: 人工智能研究代理被形式化为在解空间中导航的搜索策略,通过优化搜索策略与操作集的协同作用,在MLE-bench基准测试中取得了突破性成果。
English: AI research agents are formalized as search policies that navigate solution spaces using operators, with the interplay between search strategies and operator sets proving critical for achieving state-of-the-art performance on the MLE-bench benchmark.
Authors:Abiam Remache González, Meriem Chagour, Timon Bijan Rüth, Raúl Trapiella Cañedo, Marina MartÃnez Soler, Ãlvaro Lorenzo Felipe, Hyun-Suk Shin, MarÃa-Jesús Zamorano Serrano, Ricardo Torres, Juan-Antonio Castillo Parra, Eduardo Reyes Abad, Miguel-Ãngel Ferrer Ballester, Juan-Manuel Afonso López, Francisco-Mario Hernández Tejera, Adrian Penate-Sanchez
Abstract:
This paper introduces IMASHRIMP, an adapted system for the automated morphological analysis of white shrimp (Penaeus vannamei}, aimed at optimizing genetic selection tasks in aquaculture. Existing deep learning and computer vision techniques were modified to address the specific challenges of shrimp morphology analysis from RGBD images. IMASHRIMP incorporates two discrimination modules, based on a modified ResNet-50 architecture, to classify images by the point of view and determine rostrum integrity. It is proposed a "two-factor authentication (human and IA)" system, it reduces human error in view classification from 0.97% to 0% and in rostrum detection from 12.46% to 3.64%. Additionally, a pose estimation module was adapted from VitPose to predict 23 key points on the shrimp's skeleton, with separate networks for lateral and dorsal views. A morphological regression module, using a Support Vector Machine (SVM) model, was integrated to convert pixel measurements to centimeter units. Experimental results show that the system effectively reduces human error, achieving a mean average precision (mAP) of 97.94% for pose estimation and a pixel-to-centimeter conversion error of 0.07 (+/- 0.1) cm. IMASHRIMP demonstrates the potential to automate and accelerate shrimp morphological analysis, enhancing the efficiency of genetic selection and contributing to more sustainable aquaculture practices.The code are available at https://github.com/AbiamRemacheGonzalez/ImaShrimp-public
中文: 本文介绍了IMASHRIMP系统,它采用改进的深度学习和计算机视觉技术,通过RGBD图像自动分析白虾形态特征,显著降低了人工误差,在姿态估计和尺寸测量转换中实现高精度,有效提升了水产养殖遗传选育效率。
English: This paper presents IMASHRIMP, an automated system using adapted deep learning and computer vision to analyze white shrimp morphology from RGBD images, significantly reducing human error and achieving high precision in pose estimation and measurement conversion for enhanced genetic selection in aquaculture.
Authors:Dimitrios Bouzoulas, Eerik Alamikkotervo, Risto Ojala
Abstract:
Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. A challenge in RGB pedestrian detection, that does not appear to have large public datasets, is low-light conditions. As a solution, in this research, we propose an automated infrared-RGB labeling pipeline. The proposed pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For the evaluation, object detection models were trained on the generated autolabels and ground truth labels. When compared on a previously unseen image sequence, the results showed that the models trained on generated labels outperformed the ones trained on ground-truth labels in 6 out of 9 cases for the mAP@50 and mAP@50-95 metrics. The source code for this research is available at https://github.com/BouzoulasDimitrios/IR-RGB-Automated-LowLight-Pedestrian-Labeling
中文摘要:本研究提出了一种自动化的红外-RGB标注流程,通过将红外检测标签迁移至RGB图像来解决低光照行人检测难题,实验表明该方法在多数测试案例中优于基于人工标注的训练效果。
English Summary: This research introduces an automated infrared-RGB labeling pipeline to address low-light pedestrian detection challenges by transferring infrared detection labels to RGB images, demonstrating improved performance over ground-truth labels in most test cases.
Authors:Chenxu Wang, Yilin Lyu, Zicheng Sun, Liping Jing
Abstract:
Continual fine-tuning of Large Language Models (LLMs) is hampered by the trade-off between efficiency and expressiveness. Low-Rank Adaptation (LoRA) offers efficiency but constrains the model's ability to learn new tasks and transfer knowledge due to its low-rank nature and reliance on explicit parameter constraints. We propose GORP (Gradient LOw Rank Projection) for Continual Learning, a novel training strategy that overcomes these limitations by synergistically combining full and low-rank parameters and jointly updating within a unified low-rank gradient subspace. GORP expands the optimization space while preserving efficiency and mitigating catastrophic forgetting. Extensive experiments on continual learning benchmarks demonstrate GORP's superior performance compared to existing state-of-the-art approaches. Code is available at https://github.com/Wcxwcxw/GORP.
中文: GORP是一种新颖的持续学习策略,通过协同结合完整参数和低秩参数来扩展优化空间,同时保持效率并减少灾难性遗忘,在基准测试中优于现有方法。
English: GORP is a novel continual learning strategy that synergistically combines full and low-rank parameters to expand the optimization space while maintaining efficiency and reducing catastrophic forgetting, outperforming existing methods in benchmarks.
Authors:Luca Parolari, Andrea Cherubini, Lamberto Ballan, Carlo Biffi
Abstract:
Automated polyp counting in colonoscopy is a crucial step toward automated procedure reporting and quality control, aiming to enhance the cost-effectiveness of colonoscopy screening. Counting polyps in a procedure involves detecting and tracking polyps, and then clustering tracklets that belong to the same polyp entity. Existing methods for polyp counting rely on self-supervised learning and primarily leverage visual appearance, neglecting temporal relationships in both tracklet feature learning and clustering stages. In this work, we introduce a paradigm shift by proposing a supervised contrastive loss that incorporates temporally-aware soft targets. Our approach captures intra-polyp variability while preserving inter-polyp discriminability, leading to more robust clustering. Additionally, we improve tracklet clustering by integrating a temporal adjacency constraint, reducing false positive re-associations between visually similar but temporally distant tracklets. We train and validate our method on publicly available datasets and evaluate its performance with a leave-one-out cross-validation strategy. Results demonstrate a 2.2x reduction in fragmentation rate compared to prior approaches. Our results highlight the importance of temporal awareness in polyp counting, establishing a new state-of-the-art. Code is available at https://github.com/lparolari/temporally-aware-polyp-counting.
中文: 本研究提出了一种结合时序软目标和邻接约束的监督对比学习方法,显著提升了结肠镜息肉计数的准确性,通过降低碎片化程度增强了聚类鲁棒性,相比现有方法性能提高了2.2倍。
English: This study introduces a supervised contrastive learning method with temporal soft targets and adjacency constraints to enhance polyp counting in colonoscopy by reducing fragmentation and improving clustering robustness, achieving a 2.2x improvement over existing approaches.
Authors:Zunhui Xia, Hongxing Li, Libin Lan
Abstract:
Medical image recognition serves as a key way to aid in clinical diagnosis, enabling more accurate and timely identification of diseases and abnormalities. Vision transformer-based approaches have proven effective in handling various medical recognition tasks. However, these methods encounter two primary challenges. First, they are often task-specific and architecture-tailored, limiting their general applicability. Second, they usually either adopt full attention to model long-range dependencies, resulting in high computational costs, or rely on handcrafted sparse attention, potentially leading to suboptimal performance. To tackle these issues, we present MedFormer, an efficient medical vision transformer with two key ideas. First, it employs a pyramid scaling structure as a versatile backbone for various medical image recognition tasks, including image classification and dense prediction tasks such as semantic segmentation and lesion detection. This structure facilitates hierarchical feature representation while reducing the computation load of feature maps, highly beneficial for boosting performance. Second, it introduces a novel Dual Sparse Selection Attention (DSSA) with content awareness to improve computational efficiency and robustness against noise while maintaining high performance. As the core building technique of MedFormer, DSSA is designed to explicitly attend to the most relevant content. Theoretical analysis demonstrates that MedFormer outperforms existing medical vision transformers in terms of generality and efficiency. Extensive experiments across various imaging modality datasets show that MedFormer consistently enhances performance in all three medical image recognition tasks mentioned above. MedFormer provides an efficient and versatile solution for medical image recognition, with strong potential for clinical application.
中文摘要:MedFormer是一种高效通用的医学视觉变换器,通过金字塔缩放结构和新型双稀疏选择注意力机制解决现有方法的局限性,在多种医学图像识别任务中展现出卓越性能。
English Summary: MedFormer is an efficient and versatile medical vision transformer that uses a pyramid scaling structure and a novel Dual Sparse Selection Attention to overcome limitations of existing methods, demonstrating superior performance across various medical image recognition tasks.
Authors:Teng Fu, Yuwen Chen, Zhuofan Chen, Mengyang Zhao, Bin Li, Xiangyang Xue
Abstract:
Multi-object tracking is a classic field in computer vision. Among them, pedestrian tracking has extremely high application value and has become the most popular research category. Existing methods mainly use motion or appearance information for tracking, which is often difficult in complex scenarios. For the motion information, mutual occlusions between objects often prevent updating of the motion state; for the appearance information, non-robust results are often obtained due to reasons such as only partial visibility of the object or blurred images. Although learning how to perform tracking in these situations from the annotated data is the simplest solution, the existing MOT dataset fails to satisfy this solution. Existing methods mainly have two drawbacks: relatively simple scene composition and non-realistic scenarios. Although some of the video sequences in existing dataset do not have the above-mentioned drawbacks, the number is far from adequate for research purposes. To this end, we propose a difficult large-scale dataset for multi-pedestrian tracking, shot mainly from the first-person view and all from real-life complex scenarios. We name it ``CrowdTrack'' because there are numerous objects in most of the sequences. Our dataset consists of 33 videos, containing a total of 5,185 trajectories. Each object is annotated with a complete bounding box and a unique object ID. The dataset will provide a platform to facilitate the development of algorithms that remain effective in complex situations. We analyzed the dataset comprehensively and tested multiple SOTA models on our dataset. Besides, we analyzed the performance of the foundation models on our dataset. The dataset and project code is released at: https://github.com/loseevaya/CrowdTrack .
中文: 作者提出了“CrowdTrack”,这是一个从第一人称视角在复杂现实场景中拍摄的、具有挑战性的大规模多行人跟踪数据集,包含33个视频和5,185条完整标注轨迹,旨在克服现有数据集的局限性并推动鲁棒跟踪算法的发展。
English: The authors introduce "CrowdTrack," a challenging large-scale dataset for multi-pedestrian tracking captured from first-person views in complex real-world scenarios, designed to overcome limitations of existing datasets by providing 33 videos with 5,185 fully annotated trajectories to advance robust tracking algorithms.
Authors:Wei Li, Jingyang Zhang, Lihao Liu, Guoan Wang, Junjun He, Yang Chen, Lixu Gu
Abstract:
Test-Time Adaptation (TTA) has emerged as a promising solution for adapting a source model to unseen medical sites using unlabeled test data, due to the high cost of data annotation. Existing TTA methods consider scenarios where data from one or multiple domains arrives in complete domain units. However, in clinical practice, data usually arrives in domain fragments of arbitrary lengths and in random arrival orders, due to resource constraints and patient variability. This paper investigates a practical Free-Form Test-Time Adaptation (F$^{2}$TTA) task, where a source model is adapted to such free-form domain fragments, with shifts occurring between fragments unpredictably. In this setting, these shifts could distort the adaptation process. To address this problem, we propose a novel Image-level Disentangled Prompt Tuning (I-DiPT) framework. I-DiPT employs an image-invariant prompt to explore domain-invariant representations for mitigating the unpredictable shifts, and an image-specific prompt to adapt the source model to each test image from the incoming fragments. The prompts may suffer from insufficient knowledge representation since only one image is available for training. To overcome this limitation, we first introduce Uncertainty-oriented Masking (UoM), which encourages the prompts to extract sufficient information from the incoming image via masked consistency learning driven by the uncertainty of the source model representations. Then, we further propose a Parallel Graph Distillation (PGD) method that reuses knowledge from historical image-specific and image-invariant prompts through parallel graph networks. Experiments on breast cancer and glaucoma classification demonstrate the superiority of our method over existing TTA approaches in F$^{2}$TTA. Code is available at https://github.com/mar-cry/F2TTA.
中文摘要:测试时适应(TTA)旨在利用未标记测试数据将源模型适配到未见医疗站点,但现有方法假设数据以完整域单元到达,而临床数据实际以任意长度和随机顺序的域片段形式出现,为此提出图像级解耦提示调优(I-DiPT)框架,通过不确定性导向掩码和平行图蒸馏技术有效应对片段间不可预测的分布偏移问题。
English Summary: Test-Time Adaptation (TTA) addresses adapting source models to unlabeled medical data from new sites, but existing methods assume complete domain units, whereas real clinical data arrives in unpredictable fragments with shifts, prompting the proposed Image-level Disentangled Prompt Tuning (I-DiPT) framework that uses uncertainty-driven masking and parallel graph distillation to handle these challenges effectively.
Authors:Zihan Tan, Suyuan Huang, Guancheng Wan, Wenke Huang, He Li, Mang Ye
Abstract:
Federated Graph Learning (FGL) combines the privacy-preserving capabilities of federated learning (FL) with the strong graph modeling capability of Graph Neural Networks (GNNs). Current research addresses subgraph-FL from the structural perspective, neglecting the propagation of graph signals on spatial and spectral domains of the structure. From a spatial perspective, subgraph-FL introduces edge disconnections between clients, leading to disruptions in label signals and a degradation in the semantic knowledge of the global GNN. From a spectral perspective, spectral heterogeneity causes inconsistencies in signal frequencies across subgraphs, which makes local GNNs overfit the local signal propagation schemes. As a result, spectral client drift occurs, undermining global generalizability. To tackle the challenges, we propose a global knowledge repository to mitigate the challenge of poor semantic knowledge caused by label signal disruption. Furthermore, we design a frequency alignment to address spectral client drift. The combination of Spatial and Spectral strategies forms our framework S2FGL. Extensive experiments on multiple datasets demonstrate the superiority of S2FGL. The code is available at https://github.com/Wonder7racer/S2FGL.git.
Chinese: 联邦图学习(FGL)融合了联邦学习的隐私保护与图神经网络强大的建模能力,提出的S2FGL框架通过全局知识库和频率对齐策略解决结构和频谱层面的挑战,从而提升模型性能。
English: Federated Graph Learning (FGL) integrates federated learning's privacy protection with Graph Neural Networks' modeling power, and the proposed S2FGL framework addresses structural and spectral challenges through a global knowledge repository and frequency alignment to enhance performance.
Authors:Mufhumudzi Muthivhi, Terence L. van Zyl
Abstract:
Wildlife re-identification aims to match individuals of the same species across different observations. Current state-of-the-art (SOTA) models rely on class labels to train supervised models for individual classification. This dependence on annotated data has driven the curation of numerous large-scale wildlife datasets. This study investigates self-supervised learning Self-Supervised Learning (SSL) for wildlife re-identification. We automatically extract two distinct views of an individual using temporal image pairs from camera trap data without supervision. The image pairs train a self-supervised model from a potentially endless stream of video data. We evaluate the learnt representations against supervised features on open-world scenarios and transfer learning in various wildlife downstream tasks. The analysis of the experimental results shows that self-supervised models are more robust even with limited data. Moreover, self-supervised features outperform supervision across all downstream tasks. The code is available here https://github.com/pxpana/SSLWildlife.
中文摘要:本研究通过从相机陷阱数据中自动提取无监督的时间图像对,探索了自监督学习在野生动物重识别中的应用,证明自监督模型即使在数据有限的情况下,其鲁棒性和各项下游任务性能均优于监督学习方法。
English Summary: This study explores self-supervised learning for wildlife re-identification by automatically extracting temporal image pairs from camera trap data, demonstrating that self-supervised models outperform supervised approaches in robustness and performance across various downstream tasks even with limited data.
Authors:Taehoon Kim, Jongwook Choi, Yonghyun Jeong, Haeun Noh, Jaejun Yoo, Seungryul Baek, Jongwon Choi
Abstract:
We introduce a deepfake video detection approach that exploits pixel-wise temporal inconsistencies, which traditional spatial frequency-based detectors often overlook. Traditional detectors represent temporal information merely by stacking spatial frequency spectra across frames, resulting in the failure to detect temporal artifacts in the pixel plane. Our approach performs a 1D Fourier transform on the time axis for each pixel, extracting features highly sensitive to temporal inconsistencies, especially in areas prone to unnatural movements. To precisely locate regions containing the temporal artifacts, we introduce an attention proposal module trained in an end-to-end manner. Additionally, our joint transformer module effectively integrates pixel-wise temporal frequency features with spatio-temporal context features, expanding the range of detectable forgery artifacts. Our framework represents a significant advancement in deepfake video detection, providing robust performance across diverse and challenging detection scenarios.
Chinese: 我们的方法通过逐像素时间轴傅里叶变换检测深度伪造视频中的时序异常,并结合注意力机制与Transformer模块融合时空特征,大幅提升了检测系统的鲁棒性。
English: Our method detects deepfake videos by analyzing pixel-wise temporal inconsistencies through a 1D Fourier transform and integrates these features with spatio-temporal contexts using a transformer module, significantly improving detection robustness.
Authors:Fangru Zhou, Jun Zhao, Guoxin Wang
Abstract:
JoyTTS is an end-to-end spoken chatbot that combines large language models (LLM) with text-to-speech (TTS) technology, featuring voice cloning capabilities. This project is built upon the open-source MiniCPM-o and CosyVoice2 models and trained on 2000 hours of conversational data. We have also provided the complete training code to facilitate further development and optimization by the community. On the testing machine seed-tts-zh, it achieves a SS (speaker similarity) score of 0.73 and a WER (Word Error Rate) of 5.09. The code and models, along with training and inference scripts, are available at https://github.com/jdh-algo/JoyTTS.git.
中文:JoyTTS是一款融合大语言模型与语音合成技术的端到端语音聊天机器人,具备声音克隆功能,在测试中实现了0.73的说话人相似度和5.09%的词错率,相关代码已在GitHub开源。
English: JoyTTS is an end-to-end spoken chatbot integrating LLM and TTS with voice cloning, achieving a 0.73 speaker similarity score and 5.09% word error rate, with open-source code available on GitHub.
Authors:Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, Xiaojuan Qi
Abstract:
Vanilla autoregressive image generation models generate visual tokens step-by-step, limiting their ability to capture holistic relationships among token sequences. Moreover, because most visual tokenizers map local image patches into latent tokens, global information is limited. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Hita incorporates two key strategies to better align with the AR generation process: 1) {arranging} a sequential structure with holistic tokens at the beginning, followed by patch-level tokens, and using causal attention to maintain awareness of previous tokens; and 2) adopting a lightweight fusion module before feeding the de-quantized tokens into the decoder to control information flow and prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. Detailed analysis of the holistic representation highlights its ability to capture global image properties, such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}.
中文: Hita分词器采用整体到局部的方案,通过顺序排列标记和融合模块来增强自回归图像生成,捕捉全局属性,在ImageNet等基准测试中表现优异。
English: The Hita tokenizer introduces a holistic-to-local scheme with sequential token arrangement and fusion modules to enhance autoregressive image generation by capturing global properties, achieving superior performance on benchmarks like ImageNet.
Authors:Changhun Kim, Yechan Mun, Sangchul Hahn, Eunho Yang
Abstract:
This study proposes DeltaSHAP, a novel explainable artificial intelligence (XAI) algorithm specifically designed for online patient monitoring systems. In clinical environments, discovering the causes driving patient risk evolution is critical for timely intervention, yet existing XAI methods fail to address the unique requirements of clinical time series explanation tasks. To this end, DeltaSHAP addresses three key clinical needs: explaining the changes in the consecutive predictions rather than isolated prediction scores, providing both magnitude and direction of feature attributions, and delivering these insights in real time. By adapting Shapley values to temporal settings, our approach accurately captures feature coalition effects. It further attributes prediction changes using only the actually observed feature combinations, making it efficient and practical for time-sensitive clinical applications. We also introduce new evaluation metrics to evaluate the faithfulness of the attributions for online time series, and demonstrate through experiments on online patient monitoring tasks that DeltaSHAP outperforms state-of-the-art XAI methods in both explanation quality as 62% and computational efficiency as 33% time reduction on the MIMIC-III decompensation benchmark. We release our code at https://github.com/AITRICS/DeltaSHAP.
中文: DeltaSHAP是一种专为在线患者监测设计的新型可解释人工智能算法,通过实时解释预测变化满足临床需求,在解释质量和计算效率上均优于现有方法。
English: DeltaSHAP is a novel explainable AI algorithm tailored for online patient monitoring, addressing clinical needs by explaining prediction changes with real-time efficiency and outperforming existing methods in both explanation quality and computational speed.
Authors:Nina Konovalova, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov
Abstract:
Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).
Chinese: InnerControl通过在所有去噪步骤中使用轻量级卷积探针从中间特征重建控制信号,强制实现空间一致性,从而在与ControlNet++等技术结合时,显著提升了控制精度和生成质量。
English: InnerControl enhances text-to-image diffusion models by enforcing spatial consistency across all denoising steps through lightweight convolutional probes that reconstruct control signals from intermediate features, achieving superior alignment and generation quality when combined with existing methods like ControlNet++.
Authors:JaeHyuck Choi, MinJun Kim, JeHyeong Hong
Abstract:
Few-shot anomaly generation is emerging as a practical solution for augmenting the scarce anomaly data in industrial quality control settings. An ideal generator would meet three demands at once, namely (i) keep the normal background intact, (ii) inpaint anomalous regions to tightly overlap with the corresponding anomaly masks, and (iii) generate anomalous regions in a semantically valid location, while still producing realistic, diverse appearances from only a handful of real examples. Existing diffusion-based methods usually satisfy at most two of these requirements: global anomaly generators corrupt the background, whereas mask-guided ones often falter when the mask is imprecise or misplaced. We propose MAGIC--Mask-guided inpainting with multi-level perturbations and Context-aware alignment--to resolve all three issues. At its core, MAGIC fine-tunes a Stable Diffusion inpainting backbone that preserves normal regions and ensures strict adherence of the synthesized anomaly to the supplied mask, directly addressing background corruption and misalignment. To offset the diversity loss that fine-tuning can cause, MAGIC adds two complementary perturbation strategies: (i) Gaussian prompt-level perturbation applied during fine-tuning and inference that broadens the global appearance of anomalies while avoiding low-fidelity textual appearances, and (ii) mask-guided spatial noise injection that enriches local texture variations. Additionally, the context-aware mask alignment module forms semantic correspondences and relocates masks so that every anomaly remains plausibly contained within the host object, eliminating out-of-boundary artifacts. Under a consistent identical evaluation protocol on the MVTec-AD dataset, MAGIC outperforms previous state-of-the-arts in downstream anomaly tasks.
中文摘要:MAGIC是一种新颖的小样本异常生成方法,通过多级扰动和上下文感知对齐技术,在保留正常背景的同时精确贴合异常掩码并确保语义合理性,在MVTec-AD数据集上超越了现有最优方法。
English Summary: MAGIC is a novel few-shot anomaly generation method that preserves normal backgrounds, aligns anomalies precisely with masks, and ensures semantic plausibility through multi-level perturbations and context-aware alignment, outperforming existing techniques on the MVTec-AD dataset.
Authors:Dohoon Kim, Donghun Kang, Taesup Moon
Abstract:
Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks. We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at https://github.com/dohoonkim-ai/DoMIX.
Chinese: DoMIX提出了一种利用LoRA模块的高效并行领域自适应预训练方法,解决了持续DAP中的计算成本高、领域顺序敏感和缺乏任务专用模型的问题,并可扩展至标准大语言模型微调场景。
English: DoMIX introduces an efficient and parallel domain-adaptive pre-training method using LoRA modules to overcome computational costs, domain order sensitivity, and lack of task-specific models in continual DAP, extending its applicability to standard LLM fine-tuning.
Authors:Fanghai Yi, Zehong Zheng, Zexiao Liang, Yihang Dong, Xiyang Fang, Wangyu Wu, Xuhang Chen
Abstract:
Enhancing underwater images is crucial for exploration. These images face visibility and color issues due to light changes, water turbidity, and bubbles. Traditional prior-based methods and pixel-based methods often fail, while deep learning lacks sufficient high-quality datasets. We introduce the Multi-Axis Conditional Lookup (MAC-Lookup) model, which enhances visual quality by improving color accuracy, sharpness, and contrast. It includes Conditional 3D Lookup Table Color Correction (CLTCC) for preliminary color and quality correction and Multi-Axis Adaptive Enhancement (MAAE) for detail refinement. This model prevents over-enhancement and saturation while handling underwater challenges. Extensive experiments show that MAC-Lookup excels in enhancing underwater images by restoring details and colors better than existing methods. The code is https://github.com/onlycatdoraemon/MAC-Lookup.
中文: MAC-Lookup模型通过颜色校正和自适应增强技术,有效提升水下图像质量,恢复细节与色彩,实验证明其性能优于现有方法。
English: The MAC-Lookup model effectively enhances underwater images by using color correction and adaptive enhancement to improve visual quality and overcome common underwater challenges, outperforming existing methods in experiments.
Authors:Yuxiang Zhang, Wei Li, Wen Jia, Mengmeng Zhang, Ran Tao, Shunlin Liang
Abstract:
Utilizing hyperspectral remote sensing technology enables the extraction of fine-grained land cover classes. Typically, satellite or airborne images used for training and testing are acquired from different regions or times, where the same class has significant spectral shifts in different scenes. In this paper, we propose a Bi-directional Domain Adaptation (BiDA) framework for cross-domain hyperspectral image (HSI) classification, which focuses on extracting both domain-invariant features and domain-specific information in the independent adaptive space, thereby enhancing the adaptability and separability to the target scene. In the proposed BiDA, a triple-branch transformer architecture (the source branch, target branch, and coupled branch) with semantic tokenizer is designed as the backbone. Specifically, the source branch and target branch independently learn the adaptive space of source and target domains, a Coupled Multi-head Cross-attention (CMCA) mechanism is developed in coupled branch for feature interaction and inter-domain correlation mining. Furthermore, a bi-directional distillation loss is designed to guide adaptive space learning using inter-domain correlation. Finally, we propose an Adaptive Reinforcement Strategy (ARS) to encourage the model to focus on specific generalized feature extraction within both source and target scenes in noise condition. Experimental results on cross-temporal/scene airborne and satellite datasets demonstrate that the proposed BiDA performs significantly better than some state-of-the-art domain adaptation approaches. In the cross-temporal tree species classification task, the proposed BiDA is more than 3\%$\sim$5\% higher than the most advanced method. The codes will be available from the website: https://github.com/YuxiangZhang-BIT/IEEE_TCSVT_BiDA.
中文: 本文提出了一种双向域适应(BiDA)框架,通过三支路变换器提取域不变和域特定特征,显著提升了跨域高光谱图像分类的性能,优于现有先进方法。
English: This paper introduces a Bi-directional Domain Adaptation (BiDA) framework that uses a triple-branch transformer to enhance cross-domain hyperspectral image classification by extracting domain-invariant and domain-specific features, achieving superior performance over existing methods.
Authors:Zihao Li, Chao Yang, Tong Zhang, Yakun Chen, Xianzhi Wang, Guandong Xu, Daoyi Dong
Abstract:
Preference alignment has achieved greater success on Large Language Models (LLMs) and drawn broad interest in recommendation research. Existing preference alignment methods for recommendation either require explicit reward modeling or only support pairwise preference comparison. The former directly increases substantial computational costs, while the latter hinders training efficiency on negative samples. Moreover, no existing effort has explored preference alignment solutions for tail-item recommendation. To bridge the above gaps, we propose LPO4Rec, which extends the Bradley-Terry model from pairwise comparison to listwise comparison, to improve the efficiency of model training. Specifically, we derive a closed form optimal policy to enable more efficient and effective training without explicit reward modeling. We also present an adaptive negative sampling and reweighting strategy to prioritize tail items during optimization and enhance performance in tail-item recommendations. Besides, we theoretically prove that optimizing the listwise preference optimization (LPO) loss is equivalent to maximizing the upper bound of the optimal reward. Our experiments on three public datasets show that our method outperforms 10 baselines by a large margin, achieving up to 50% performance improvement while reducing 17.9% GPU memory usage when compared with direct preference optimization (DPO) in tail-item recommendation. Our code is available at https://github.com/Yuhanleeee/LPO4Rec.
中文摘要:提出的LPO4Rec方法通过将成对比较扩展为列表比较,在无需显式奖励建模的情况下提升推荐系统训练效率,并采用自适应负采样策略显著改善长尾物品推荐效果。
English Summary: The proposed LPO4Rec method introduces listwise preference optimization for recommendation systems, eliminating the need for explicit reward modeling while improving training efficiency and tail-item recommendation performance through adaptive sampling strategies.
Authors:Minghao Ning, Yufeng Yang, Keqi Shu, Shucheng Huang, Jiaming Zhong, Maryam Salehi, Mahdi Rahmani, Yukun Lu, Chen Sun, Aladdin Saleh, Ehsan Hashemi, Amir Khajepour
Abstract:
We present CoInfra, a large-scale cooperative infrastructure perception system and dataset designed to advance robust multi-agent perception under real-world and adverse weather conditions. The CoInfra system includes 14 fully synchronized sensor nodes, each equipped with dual RGB cameras and a LiDAR, deployed across a shared region and operating continuously to capture all traffic participants in real-time. A robust, delay-aware synchronization protocol and a scalable system architecture that supports real-time data fusion, OTA management, and remote monitoring are provided in this paper. On the other hand, the dataset was collected in different weather scenarios, including sunny, rainy, freezing rain, and heavy snow and includes 195k LiDAR frames and 390k camera images from 8 infrastructure nodes that are globally time-aligned and spatially calibrated. Furthermore, comprehensive 3D bounding box annotations for five object classes (i.e., car, bus, truck, person, and bicycle) are provided in both global and individual node frames, along with high-definition maps for contextual understanding. Baseline experiments demonstrate the trade-offs between early and late fusion strategies, the significant benefits of HD map integration are discussed. By openly releasing our dataset, codebase, and system documentation at https://github.com/NingMingHao/CoInfra, we aim to enable reproducible research and drive progress in infrastructure-supported autonomous driving, particularly in challenging, real-world settings.
中文: CoInfra提出了一个大规模协同基础设施感知系统与数据集,通过部署同步多传感器节点并采集多天气场景数据,旨在推动恶劣天气下多智能体感知研究,促进自动驾驶技术发展。
English: CoInfra introduces a large-scale cooperative infrastructure perception system and dataset for robust multi-agent perception in adverse weather, featuring synchronized multi-sensor nodes and comprehensive annotations to advance autonomous driving research.
Authors:Steven Song, Anirudh Subramanyam, Zhenyu Zhang, Aarti Venkat, Robert L. Grossman
Abstract:
The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. We implement and share GDC Cohort Copilot as a containerized Gradio app on HuggingFace Spaces, available at https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds. All source code is available at https://github.com/uc-cdis/gdc-cohort-copilot.
中文: GDC Cohort Copilot 是一款开源工具,通过本地部署的大型语言模型将用户自然语言描述自动转化为精准的基因组数据共享平台队列筛选条件,其性能优于GPT-4o,并配备交互式界面供用户进一步优化队列结果。
English: The GDC Cohort Copilot is an open-source tool that uses locally-served large language models to convert natural language descriptions into precise cohort filters for the Genomic Data Commons, outperforming GPT-4o in accuracy and enabling interactive refinement through a user-friendly interface.
Authors:Xiao Wang, Jingtao Jiang, Qiang Chen, Lan Chen, Lin Zhu, Yaowei Wang, Yonghong Tian, Jin Tang
Abstract:
Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt*, IC15*) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/ESTR-CoT.
中文: 本文提出了一种新颖的事件流场景文本识别框架ESTR-CoT,通过引入思维链推理机制提升了解释能力和上下文逻辑,在多个基准数据集上的实验充分验证了其有效性。
English: This paper introduces ESTR-CoT, a novel event stream scene text recognition framework that integrates chain-of-thought reasoning to enhance interpretability and contextual logic, validated by extensive experiments on benchmark datasets.
Authors:Wenquan Lu, Yuechuan Yang, Kyle Lee, Yanshu Li, Enqi Liu
Abstract:
Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent architectures that aim to internalize reasoning in latent space, potentially supporting latent CoT. In this paper, we investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count. We examine the model's internal behavior on arithmetic tasks using a suite of probing techniques including the Logit Lens and Coda Lens. Our findings reveal limited evidence of interpretable latent CoT by tracking rank trajectories of final and intermediate result tokens. Furthermore, we uncover significant probing inconsistencies across recurrent blocks, where the interpretability of hidden states depends heavily on both the layer index and the decoding method. Finally, we empirically show that increasing recurrence depth yields only marginal gains and falls well short of models that explicitly externalize reasoning steps. The code is available at https://github.com/wenquanlu/huginn-latent-cot.
中文: 本研究探究深度循环Transformer模型Huginn-3.5B在算术任务中是否形成潜在思维链推理,发现可解释证据有限且增加循环深度仅带来微弱性能提升,远不及显式推理模型。
English: This study investigates whether Huginn-3.5B, a depth-recurrent Transformer, develops latent chain-of-thought reasoning during arithmetic tasks, finding limited interpretable evidence and only marginal performance gains from increased recurrence depth compared to explicit reasoning models.
Authors:Wenquan Lu, Yuechuan Yang, Kyle Lee, Yanshu Li, Enqi Liu
Abstract:
Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent architectures that aim to internalize reasoning in latent space, potentially supporting latent CoT. In this paper, we investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count. We examine the model's internal behavior on arithmetic tasks using a suite of probing techniques including the Logit Lens and Coda Lens. Our findings reveal limited evidence of interpretable latent CoT by tracking rank trajectories of final and intermediate result tokens. Furthermore, we uncover significant probing inconsistencies across recurrent blocks, where the interpretability of hidden states depends heavily on both the layer index and the decoding method. Finally, we empirically show that increasing recurrence depth yields only marginal gains and falls well short of models that explicitly externalize reasoning steps. The code is available at https://github.com/wenquanlu/huginn-latent-cot.
中文: 本研究探究深度循环Transformer模型Huginn-3.5B在算术任务中是否形成潜在思维链推理,发现可解释证据有限且增加循环深度仅带来微弱性能提升,远不及显式推理模型。
English: This study investigates whether Huginn-3.5B, a depth-recurrent Transformer, develops latent chain-of-thought reasoning during arithmetic tasks, finding limited interpretable evidence and only marginal performance gains from increased recurrence depth compared to explicit reasoning models.
Authors:Tuo Wang, Jian Kang, Yujun Yan, Adithya Kulkarni, Dawei Zhou
Abstract:
Conformal prediction for graph neural networks (GNNs) offers a promising framework for quantifying uncertainty, enhancing GNN reliability in high-stakes applications. However, existing methods predominantly focus on static graphs, neglecting the evolving nature of real-world graphs. Temporal dependencies in graph structure, node attributes, and ground truth labels violate the fundamental exchangeability assumption of standard conformal prediction methods, limiting their applicability. To address these challenges, in this paper, we introduce NCPNET, a novel end-to-end conformal prediction framework tailored for temporal graphs. Our approach extends conformal prediction to dynamic settings, mitigating statistical coverage violations induced by temporal dependencies. To achieve this, we propose a diffusion-based non-conformity score that captures both topological and temporal uncertainties within evolving networks. Additionally, we develop an efficiency-aware optimization algorithm that improves the conformal prediction process, enhancing computational efficiency and reducing coverage violations. Extensive experiments on diverse real-world temporal graphs, including WIKI, REDDIT, DBLP, and IBM Anti-Money Laundering dataset, demonstrate NCPNET's capability to ensure guaranteed coverage in temporal graphs, achieving up to a 31% reduction in prediction set size on the WIKI dataset, significantly improving efficiency compared to state-of-the-art methods. Our data and code are available at https://github.com/ODYSSEYWT/NCPNET.
中文: NCPNET提出了一种面向时序图的全流程保形预测框架,通过基于扩散的非共形分数和效率优化解决数据可交换性失效问题,在保证覆盖度的同时将预测集尺寸缩减最高31%。
English: NCPNET introduces an end-to-end conformal prediction framework for temporal graphs, addressing exchangeability violations through diffusion-based non-conformity scores and efficiency optimization, achieving up to 31% smaller prediction sets with guaranteed coverage.
Authors:Shikai Qiu, Lechao Xiao, Andrew Gordon Wilson, Jeffrey Pennington, Atish Agarwala
Abstract:
What scaling limits govern neural network training dynamics when model size and training time grow in tandem? We show that despite the complex interactions between architecture, training algorithms, and data, compute-optimally trained models exhibit a remarkably precise universality. Specifically, loss curves from models of varying sizes collapse onto a single universal curve when training compute and loss are normalized to unity at the end of training. With learning rate decay, the collapse becomes so tight that differences in the normalized curves across models fall below the noise floor of individual loss curves across random seeds, a phenomenon we term supercollapse. We observe supercollapse across learning rate schedules, datasets, and architectures, including transformers trained on next-token prediction, and find it breaks down when hyperparameters are scaled suboptimally, providing a precise and practical indicator of good scaling. We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws, and analyzing a simple yet surprisingly effective model of SGD noise dynamics that accurately predicts loss curves across various learning rate schedules and quantitatively explains the origin of supercollapse.
中文:当模型规模与训练时间按最优方式扩展时,神经网络的损失曲线会收敛到统一的标准化轨迹;在使用学习率衰减时会出现"超级收敛"现象,此时不同模型间的差异会低于随机噪声水平。
English: When model size and training time are scaled optimally, neural networks exhibit universal loss curves that collapse onto a single normalized trajectory, with supercollapse occurring under learning rate decay where differences fall below random noise levels.
Authors:Dan Vanderkam
Abstract:
Finding all the words on a Boggle board is a classic computer programming problem. With a fast Boggle solver, local optimization techniques such as hillclimbing and simulated annealing can be used to find particularly high-scoring boards. The sheer number of possible Boggle boards has historically prevented an exhaustive search for the global optimum board. We apply Branch and Bound and a decision diagram-like data structure to perform the first such search. We find that the highest-scoring boards found via hillclimbing are, in fact, the global optima.
中文: 本研究首次运用分支定界法和决策图结构对Boggle棋盘进行穷举搜索,证实了通过爬山法找到的高分棋盘即为全局最优解。
English: This study pioneers the use of Branch and Bound with a decision diagram structure to exhaustively search for the highest-scoring Boggle board, confirming that hillclimbing-identified boards are indeed global optima.
Authors:Kehinde Ajayi, Yi He, Jian Wu
Abstract:
Table structure recognition (TSR) and optical character recognition (OCR) play crucial roles in extracting structured data from tables in scientific documents. However, existing extraction frameworks built on top of TSR and OCR methods often fail to quantify the uncertainties of extracted results. To obtain highly accurate data for scientific domains, all extracted data must be manually verified, which can be time-consuming and labor-intensive. We propose a framework that performs uncertainty-aware data extraction for complex scientific tables, built on conformal prediction, a model-agnostic method for uncertainty quantification (UQ). We explored various uncertainty scoring methods to aggregate the uncertainties introduced by TSR and OCR. We rigorously evaluated the framework using a standard benchmark and an in-house dataset consisting of complex scientific tables in six scientific domains. The results demonstrate the effectiveness of using UQ for extraction error detection, and by manually verifying only 47% of extraction results, the data quality can be improved by 30%. Our work quantitatively demonstrates the role of UQ with the potential of improving the efficiency in the human-machine cooperation process to obtain scientifically usable data from complex tables in scientific documents. All code and data are available on GitHub at https://github.com/lamps-lab/TSR-OCR-UQ/tree/main.
中文: 该框架采用保形预测方法量化表格结构识别和光学字符识别的不确定性,仅需人工验证47%的提取结果即可提升30%的数据质量,有效优化了科学表格数据提取中的人机协作效率。
English: The proposed framework uses conformal prediction to quantify uncertainties in table structure recognition and optical character recognition, enabling efficient error detection and improving data quality by 30% with manual verification of only 47% of extraction results.
Authors:Ziyi Yang, Guangyu Hu, Mingkai Miao, Changyuan Yu, Hongce Zhang
Abstract:
SAT sweeping has long been a cornerstone technique in logic simplification and equivalence checking at the bit level, leveraging structural hashing, simulation and SAT solving to prune redundant logic. However, with the growing adoption of word-level constructs in hardware verification, such as bit-vector operations, arithmetics and arrays, there lacks a counterpart of SAT sweeping at the word level. In this paper, we introduce SMT-Sweep, a novel extension of SAT sweeping into the word level, grounded in Satisfiability Modulo Theories (SMT). SMT-Sweep takes advantage of simulation and equivalence detection to handle SMT terms with rich bit-vector operations and array semantics. Our framework incorporates both randomized and constraint-driven word-level simulation tailored to symbolic expressions and operator semantics beyond pure Boolean logic. Experimental results show that SMT-Sweep achieves significant speed-up compared to state-of-the-art bit-level SAT sweeping and word-level monolithic SMT solving (averaging around 44x and 69x, respectively).To the best of our knowledge, this is the first work that brings sweeping techniques to SMT-based hardware verification. The implementation is open-sourced at: https://github.com/yangziyiiii/SMT-Sweep.
中文: 本文提出SMT-Sweep,作为SAT扫描技术的创新性字级扩展,通过可满足性模理论处理复杂的硬件验证结构(如位向量操作和数组),相比现有方法实现了显著的加速效果。
English: This paper introduces SMT-Sweep, a novel word-level extension of SAT sweeping that utilizes Satisfiability Modulo Theories to efficiently handle complex hardware verification constructs like bit-vector operations and arrays, achieving significant speed improvements over existing methods.
Authors:Yongsen Zheng, Zongxuan Xie, Guohua Wang, Ziyao Liu, Liang Lin, Kwok-Yan Lam
Abstract:
Unfairness is a well-known challenge in Recommender Systems (RSs), often resulting in biased outcomes that disadvantage users or items based on attributes such as gender, race, age, or popularity. Although some approaches have started to improve fairness recommendation in offline or static contexts, the issue of unfairness often exacerbates over time, leading to significant problems like the Matthew effect, filter bubbles, and echo chambers. To address these challenges, we proposed a novel framework, Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System (HyFairCRS), aiming to promote multi-interest diversity fairness in dynamic and interactive Conversational Recommender Systems (CRSs). HyFairCRS first captures a wide range of user interests by establishing diverse hypergraphs through contrastive learning. These interests are then utilized in conversations to generate informative responses and ensure fair item predictions within the dynamic user-system feedback loop. Experiments on two CRS-based datasets show that HyFairCRS achieves a new state-of-the-art performance while effectively alleviating unfairness. Our code is available at https://github.com/zysensmile/HyFairCRS.
Chinese: HyFairCRS框架通过超图对比学习捕捉多样化用户兴趣,有效缓解动态交互中推荐系统的不公平问题,并在实验中实现了最优性能表现。
English: The HyFairCRS framework addresses unfairness in conversational recommender systems by using hypergraph contrastive learning to capture diverse user interests, achieving state-of-the-art performance while mitigating bias in dynamic interactions.
Authors:Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie
Abstract:
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
中文: 我们推出MetaStone-S1反射生成模型,通过专注高质量推理轨迹选择的新形式,以仅32B参数量实现与OpenAI o3-mini相当的性能,无需过程级标注且额外参数极少,并已开源供研究社区使用。
English: We introduce MetaStone-S1, a reflective generative model that matches OpenAI o3-mini's performance through a novel form emphasizing high-quality reasoning trajectory selection with minimal added parameters and no process-level annotation dependency, achieving comparable results with only 32B parameters.
Authors:Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yang Zhou, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zhenhua Wu, Zhenyu Li, Zhixin Ling, Ziming Li, Dehua Ma, Di Xu, Haixuan Gao, Hang Li, Jiawei Guo, Jing Wang, Lejian Ren, Muhao Wei, Qianqian Wang, Qigen Hu, Shiyao Wang, Tao Yu, Xinchen Luo, Yan Li, Yiming Liang, Yuhang Hu, Zeyi Lu, Zhuoran Yang, Zixing Zhang
Abstract:
While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode ``cold-start'' data mixture, which includes ``thinking'', ``non-thinking'', ``auto-think'', ``think with image'', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the \textbf{KC-MMBench}, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.
Chinese: Kwai Keye-VL 是一个拥有80亿参数的多模态基础模型,通过四阶段预训练和两阶段后训练的创新方法,在短视频理解方面实现领先性能,同时在通用视觉语言任务中保持强大竞争力。
English: Kwai Keye-VL is an 8-billion-parameter multimodal model designed to excel in understanding dynamic short videos through a four-stage pre-training and two-phase post-training process, achieving state-of-the-art results on video benchmarks while maintaining strong general vision-language capabilities.
Authors:Yiming Ju, Jijin Hu, Zhengxiong Luo, Haoge Deng, hanyu Zhao, Li Du, Chengwei Wu, Donglin Hao, Xinlong Wang, Tengfei Pan
Abstract:
Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text-video (T-V) pairs and thus fail to support the modeling of coherent multi-clip video sequences. To address this limitation, we introduce CI-VID, a dataset that moves beyond isolated text-to-video (T2V) generation toward text-and-video-to-video (TV2V) generation, enabling models to produce coherent, multi-scene video sequences. CI-VID contains over 340,000 samples, each featuring a coherent sequence of video clips with text captions that capture both the individual content of each clip and the transitions between them, enabling visually and textually grounded generation. To further validate the effectiveness of CI-VID, we design a comprehensive, multi-dimensional benchmark incorporating human evaluation, VLM-based assessment, and similarity-based metrics. Experimental results demonstrate that models trained on CI-VID exhibit significant improvements in both accuracy and content consistency when generating video sequences. This facilitates the creation of story-driven content with smooth visual transitions and strong temporal coherence, underscoring the quality and practical utility of the CI-VID dataset We release the CI-VID dataset and the accompanying code for data construction and evaluation at: https://github.com/ymju-BAAI/CI-VID
Chinese: CI-VID数据集通过提供34万多个带详细描述的多片段连贯视频序列,推动了文本到视频生成的发展,使模型能够生成具有更高准确性和时序一致性的故事驱动视频内容。
English: The CI-VID dataset advances text-to-video generation by providing over 340,000 coherent multi-clip sequences with detailed captions, enabling models to produce story-driven videos with improved accuracy and temporal consistency.
Authors:Zhentan Zheng
Abstract:
Deep neural networks have achieved remarkable results in computer vision tasks. In the early days, Convolutional Neural Networks (CNNs) were the mainstream architecture. In recent years, Vision Transformers (ViTs) have become increasingly popular. In addition, exploring applications of multi-layer perceptrons (MLPs) has provided new perspectives for research into vision model architectures. In this paper, we present evMLP accompanied by a simple event-driven local update mechanism. The proposed evMLP can independently process patches on images or feature maps via MLPs. We define changes between consecutive frames as "events". Under the event-driven local update mechanism, evMLP selectively processes patches where events occur. For sequential image data (e.g., video processing), this approach improves computational performance by avoiding redundant computations. Through ImageNet image classification experiments, evMLP attains accuracy competitive with state-of-the-art models. More significantly, experimental results on multiple video datasets demonstrate that evMLP reduces computational cost via its event-driven local update mechanism while maintaining output consistency with its non-event-driven baseline. The code and trained models are available at https://github.com/i-evi/evMLP.
中文: 本文提出evMLP模型,采用事件驱动的多层感知机架构,通过选择性处理帧间变化图像块,在ImageNet分类中达到先进精度,并在视频任务中通过局部更新机制显著降低计算开销。
English: The paper introduces evMLP, an event-driven multi-layer perceptron model that selectively processes image patches with changes between frames, achieving competitive accuracy on ImageNet while reducing computational costs in video processing tasks.
Authors:Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, Yimeng Zhang, Yihao Liang, Yuhang Zhou, Jiaqi Wang, Zhi Chen, Wanxiang Che
Abstract:
Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs) such as OpenAI-o1 and DeepSeek-R1, have demonstrated remarkable capabilities in complex domains such as logical reasoning and experimental coding. Motivated by these advancements, numerous studies have explored the application of AI in the innovation process, particularly in the context of scientific research. These AI technologies primarily aim to develop systems that can autonomously conduct research processes across a wide range of scientific disciplines. Despite these significant strides, a comprehensive survey on AI for Research (AI4Research) remains absent, which hampers our understanding and impedes further development in this field. To address this gap, we present a comprehensive survey and offer a unified perspective on AI4Research. Specifically, the main contributions of our work are as follows: (1) Systematic taxonomy: We first introduce a systematic taxonomy to classify five mainstream tasks in AI4Research. (2) New frontiers: Then, we identify key research gaps and highlight promising future directions, focusing on the rigor and scalability of automated experiments, as well as the societal impact. (3) Abundant applications and resources: Finally, we compile a wealth of resources, including relevant multidisciplinary applications, data corpora, and tools. We hope our work will provide the research community with quick access to these resources and stimulate innovative breakthroughs in AI4Research.
人工智能在大型语言模型等领域的最新进展已能支持自主科研,但缺乏全面综述阻碍了发展;本研究通过提出系统分类法、指明新方向并整合丰富资源,填补了这一空白,旨在推动科研人工智能的创新突破。
Recent advances in AI, especially in large language models, have enabled autonomous scientific research, but the lack of a comprehensive survey hinders progress; this work fills that gap by providing a systematic taxonomy, identifying new frontiers, and compiling abundant resources to foster innovation in AI for Research.
Authors:Kunlun Xu, Fan Zhuo, Jiangmeng Li, Xu Zou, Jiahuan Zhou
Abstract:
Current lifelong person re-identification (LReID) methods predominantly rely on fully labeled data streams. However, in real-world scenarios where annotation resources are limited, a vast amount of unlabeled data coexists with scarce labeled samples, leading to the Semi-Supervised LReID (Semi-LReID) problem where LReID methods suffer severe performance degradation. Existing LReID methods, even when combined with semi-supervised strategies, suffer from limited long-term adaptation performance due to struggling with the noisy knowledge occurring during unlabeled data utilization. In this paper, we pioneer the investigation of Semi-LReID, introducing a novel Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation framework (SPRED). Our key innovation lies in establishing a self-reinforcing cycle between dynamic prototype-guided pseudo-label generation and new-old knowledge collaborative purification to enhance the utilization of unlabeled data. Specifically, learnable identity prototypes are introduced to dynamically capture the identity distributions and generate high-quality pseudo-labels. Then, the dual-knowledge cooperation scheme integrates current model specialization and historical model generalization, refining noisy pseudo-labels. Through this cyclic design, reliable pseudo-labels are progressively mined to improve current-stage learning and ensure positive knowledge propagation over long-term learning. Experiments on the established Semi-LReID benchmarks show that our SPRED achieves state-of-the-art performance. Our source code is available at https://github.com/zhoujiahuan1991/ICCV2025-SPRED
中文: 本文针对半监督终身行人重识别问题,提出SPRED框架,通过动态原型引导的伪标签生成与新旧知识协同净化的自我增强循环,在基准测试中取得了最优性能。
English: This paper introduces a novel framework called SPRED to address the Semi-Supervised Lifelong Person Re-Identification problem by establishing a self-reinforcing cycle between dynamic prototype-guided pseudo-label generation and dual-knowledge purification, achieving state-of-the-art performance on benchmarks.
Authors:Samridhi Raj Sinha, Rajvee Sheth, Abhishek Upperwal, Mayank Singh
Abstract:
The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that address the requirements of linguistically diverse regions, such as India, and go beyond English-centric benchmarks. We introduce EKA-EVAL, a unified evaluation framework that integrates over 35+ benchmarks (including 10 Indic benchmarks) across nine major evaluation categories. The framework provides broader coverage than existing Indian language evaluation tools, offering 11 core capabilities through a modular architecture, seamless integration with Hugging Face and proprietary models, and plug-and-play usability. As the first end-to-end suite for scalable, multilingual LLM benchmarking, the framework combines extensive benchmarks, modular workflows, and dedicated support for low-resource Indian languages to enable inclusive assessment of LLM capabilities across diverse domains. We conducted extensive comparisons against five existing baselines, demonstrating that EKA-EVAL achieves the highest participant ratings in four out of five categories. The framework is open-source and publicly available at: https://github.com/lingo-iitgn/eka-eval.
中文:EKA-EVAL是一个统一的开源评估框架,整合了35个以上基准(含10个印度语言基准),通过模块化架构提供多语言大模型的可扩展评估,在多数类别中优于现有基线。
English: EKA-EVAL is a comprehensive, open-source evaluation framework that integrates over 35 benchmarks, including 10 Indic ones, to provide scalable multilingual assessment of Large Language Models, outperforming existing baselines in most categories.
Authors:Hailong Yan, Ao Li, Xiangtao Zhang, Zhe Liu, Zenglin Shi, Ce Zhu, Le Zhang
Abstract:
Recent advancements in deep neural networks have driven significant progress in image enhancement (IE). However, deploying deep learning models on resource-constrained platforms, such as mobile devices, remains challenging due to high computation and memory demands. To address these challenges and facilitate real-time IE on mobile, we introduce an extremely lightweight Convolutional Neural Network (CNN) framework with around 4K parameters. Our approach integrates reparameterization with an Incremental Weight Optimization strategy to ensure efficiency. Additionally, we enhance performance with a Feature Self-Transform module and a Hierarchical Dual-Path Attention mechanism, optimized with a Local Variance-Weighted loss. With this efficient framework, we are the first to achieve real-time IE inference at up to 1,100 frames per second (FPS) while delivering competitive image quality, achieving the best trade-off between speed and performance across multiple IE tasks. The code will be available at https://github.com/AVC2-UESTC/MobileIE.git.
中文摘要:本文提出了一种仅含约4K参数的极轻量CNN框架,通过创新的优化技术,在移动设备上实现了高达1100 FPS的实时图像增强,同时保持了有竞争力的图像质量。
English Summary: This paper introduces an extremely lightweight CNN framework with only around 4K parameters, achieving real-time image enhancement at up to 1,100 FPS on mobile devices while maintaining competitive image quality through innovative optimization techniques.
Authors:Tyler Ward, Meredith K. Owen, O'Kira Coleman, Brian Noehren, Abdullah-Al-Zubaer Imran
Abstract:
Medical image segmentation is a key task in the imaging workflow, influencing many image-based decisions. Traditional, fully-supervised segmentation models rely on large amounts of labeled training data, typically obtained through manual annotation, which can be an expensive, time-consuming, and error-prone process. This signals a need for accurate, automatic, and annotation-efficient methods of training these models. We propose ADA-SAM (automated, domain-specific, and adaptive segment anything model), a novel multitask learning framework for medical image segmentation that leverages class activation maps from an auxiliary classifier to guide the predictions of the semi-supervised segmentation branch, which is based on the Segment Anything (SAM) framework. Additionally, our ADA-SAM model employs a novel gradient feedback mechanism to create a learnable connection between the segmentation and classification branches by using the segmentation gradients to guide and improve the classification predictions. We validate ADA-SAM on real-world clinical data collected during rehabilitation trials, and demonstrate that our proposed method outperforms both fully-supervised and semi-supervised baselines by double digits in limited label settings. Our code is available at: https://github.com/tbwa233/ADA-SAM.
Chinese: ADA-SAM是一种自动化医学图像分割框架,通过整合类别激活图和梯度反馈机制,在有限标注条件下显著提升了分割精度,性能远超现有基准方法。
English: ADA-SAM is an automated medical image segmentation framework that integrates class activation maps and a gradient feedback mechanism to enhance accuracy in limited-label settings, significantly outperforming existing methods.
Authors:Dmytro Kuzmenko, Nadiya Shvai
Abstract:
We present a novel approach to knowledge transfer in model-based reinforcement learning, addressing the critical challenge of deploying large world models in resource-constrained environments. Our method efficiently distills a high-capacity multi-task agent (317M parameters) into a compact model (1M parameters) on the MT30 benchmark, significantly improving performance across diverse tasks. Our distilled model achieves a state-of-the-art normalized score of 28.45, surpassing the original 1M parameter model score of 18.93. This improvement demonstrates the ability of our distillation technique to capture and consolidate complex multi-task knowledge. We further optimize the distilled model through FP16 post-training quantization, reducing its size by $\sim$50\%. Our approach addresses practical deployment limitations and offers insights into knowledge representation in large world models, paving the way for more efficient and accessible multi-task reinforcement learning systems in robotics and other resource-constrained applications. Code available at https://github.com/dmytro-kuzmenko/td-mpc-opt.
中文: 我们提出了一种新颖的知识蒸馏方法,将大型多任务智能体高效压缩为紧凑模型,在资源受限环境中实现了最先进的性能,并将模型大小减少了50%。
English: We introduce a novel knowledge distillation method that effectively compresses a large multi-task agent into a compact model, achieving state-of-the-art performance and 50% size reduction for deployment in resource-limited settings.
Authors:Shengli Zhou, Jianuo Zhu, Qilin Huang, Fangjing Wang, Yanfu Zhang, Feng Zheng
Abstract:
3D Visual Question-Answering (3D VQA) is pivotal for models to perceive the physical world and perform spatial reasoning. Answer-centric supervision is a commonly used training method for 3D VQA models. Many models that utilize this strategy have achieved promising results in 3D VQA tasks. However, the answer-centric approach only supervises the final output of models and allows models to develop reasoning pathways freely. The absence of supervision on the reasoning pathway enables the potential for developing superficial shortcuts through common patterns in question-answer pairs. Moreover, although slow-thinking methods advance large language models, they suffer from underthinking. To address these issues, we propose \textbf{HCNQA}, a 3D VQA model leveraging a hierarchical concentration narrowing supervision method. By mimicking the human process of gradually focusing from a broad area to specific objects while searching for answers, our method guides the model to perform three phases of concentration narrowing through hierarchical supervision. By supervising key checkpoints on a general reasoning pathway, our method can ensure the development of a rational and effective reasoning pathway. Extensive experimental results demonstrate that our method can effectively ensure that the model develops a rational reasoning pathway and performs better. The code is available at https://github.com/JianuoZhu/HCNQA.
中文: HCNQA模型采用分层聚焦监督方法,通过模拟人类逐步聚焦的推理过程来引导3D视觉问答模型建立合理推理路径,从而提升性能表现。
English: The proposed HCNQA model introduces hierarchical concentration narrowing supervision to guide 3D VQA models through a structured reasoning pathway, effectively preventing superficial shortcuts and enhancing performance.
Authors:Tianze Hua, Tian Yun, Ellie Pavlick
Abstract:
AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption "A photo of a cat") and ask the model to report the information present in one of the specific modalities (e.g., "What does the caption say / What is in the image?"). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model, and that specific attention heads can restructure the representations to favor one modality over the other. Moreover, we find modality-agnostic "router heads" which appear to promote answers about the modality requested in the instruction, and which can be manipulated or transferred in order to improve performance across datasets and modalities. Together, the work provides essential steps towards identifying and controlling if and how models detect and resolve conflicting signals within complex multimodal environments.
中文: 本研究探讨了视觉语言模型如何处理图像与文本模态间的冲突输入,发现模型常偏向某一模态,特定注意力头可调控这种偏好,并实现跨模态性能优化。
English: This study investigates how vision-language models handle conflicting inputs between image and text modalities, revealing that models often favor one modality over the other, with specific attention heads influencing this preference and enabling control over modality prioritization.
Authors:Eric Vin, Kyle A. Miller, Daniel J. Fremont
Abstract:
We propose LeanLTL, a unifying framework for linear temporal logics in Lean 4. LeanLTL supports reasoning about traces that represent either infinite or finite linear time. The library allows traditional LTL syntax to be combined with arbitrary Lean expressions, making it straightforward to define properties involving numerical or other types. We prove that standard flavors of LTL can be embedded in our framework. The library also provides automation for reasoning about LeanLTL formulas in a way that facilitates using Lean's existing tactics. Finally, we provide examples illustrating the utility of the library in reasoning about systems that come from applications.
中文: LeanLTL 是 Lean 4 中的一个统一框架,支持对无限和有限线性时序逻辑进行推理,将传统 LTL 语法与 Lean 表达式相结合,并提供公式推理的自动化工具。
English: LeanLTL is a unifying framework in Lean 4 that supports reasoning about both infinite and finite linear temporal logics, combining traditional LTL syntax with Lean expressions and providing automation for formula reasoning.
Authors:Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, Wankou Yang
Abstract:
Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.
中文摘要:DeRIS框架通过分解指称图像分割为感知与认知模块,发现当前模型主要瓶颈在于多模态认知能力不足,并提出循环协同机制增强模块间协作,同时引入数据增强方法提升通用性。
English Summary: The DeRIS framework addresses the primary bottleneck in Referring Image Segmentation by identifying insufficient multimodal cognitive capacity as the key limitation and introducing a Loopback Synergy mechanism to enhance perception-cognition integration.
Authors:Xupeng Zhu, Fan Wang, Robin Walters, Jane Shi
Abstract:
Diffusion Policies are effective at learning closed-loop manipulation policies from human demonstrations but generalize poorly to novel arrangements of objects in 3D space, hurting real-world performance. To address this issue, we propose Spherical Diffusion Policy (SDP), an SE(3) equivariant diffusion policy that adapts trajectories according to 3D transformations of the scene. Such equivariance is achieved by embedding the states, actions, and the denoising process in spherical Fourier space. Additionally, we employ novel spherical FiLM layers to condition the action denoising process equivariantly on the scene embeddings. Lastly, we propose a spherical denoising temporal U-net that achieves spatiotemporal equivariance with computational efficiency. In the end, SDP is end-to-end SE(3) equivariant, allowing robust generalization across transformed 3D scenes. SDP demonstrates a large performance improvement over strong baselines in 20 simulation tasks and 5 physical robot tasks including single-arm and bi-manual embodiments. Code is available at https://github.com/amazon-science/Spherical_Diffusion_Policy.
Chinese: 球面扩散策略(SDP)提出了一种SE(3)等变扩散策略,通过在球面傅里叶空间中嵌入状态和动作,有效提升了三维场景变换下的泛化能力,在仿真与实体机器人任务中均实现了显著性能提升。
English: The Spherical Diffusion Policy (SDP) introduces an SE(3) equivariant diffusion policy that enhances generalization to novel 3D object arrangements by embedding states and actions in spherical Fourier space, achieving significant performance gains in both simulation and physical robot tasks.
Authors:Zixin Chen, Hongzhan Lin, Kaixin Li, Ziyang Luo, Zhen Ye, Guang Chen, Zhiyong Huang, Jing Ma
Abstract:
The proliferation of multimodal memes in the social media era demands that multimodal Large Language Models (mLLMs) effectively understand meme harmfulness. Existing benchmarks for assessing mLLMs on harmful meme understanding rely on accuracy-based, model-agnostic evaluations using static datasets. These benchmarks are limited in their ability to provide up-to-date and thorough assessments, as online memes evolve dynamically. To address this, we propose AdamMeme, a flexible, agent-based evaluation framework that adaptively probes the reasoning capabilities of mLLMs in deciphering meme harmfulness. Through multi-agent collaboration, AdamMeme provides comprehensive evaluations by iteratively updating the meme data with challenging samples, thereby exposing specific limitations in how mLLMs interpret harmfulness. Extensive experiments show that our framework systematically reveals the varying performance of different target mLLMs, offering in-depth, fine-grained analyses of model-specific weaknesses. Our code is available at https://github.com/Lbotirx/AdamMeme.
中文: AdamMeme框架作为一种基于智能体的自适应评估工具,通过多智能体协作和迭代更新数据,动态评估多模态大语言模型在识别有害表情包方面的推理能力。
English: The AdamMeme framework is introduced as an adaptive, agent-based evaluation tool that dynamically assesses multimodal large language models' reasoning abilities in detecting harmful memes through iterative updates and multi-agent collaboration.
Authors:Martine Hjelkrem-Tan, Marius Aasan, Gabriel Y. Arteaga, AdÃn RamÃrez Rivera
Abstract:
Vision Transformers naturally accommodate sparsity, yet standard tokenization methods confine features to discrete patch grids. This constraint prevents models from fully exploiting sparse regimes, forcing awkward compromises. We propose Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations. With our proposed oracle-guided search, we uncover substantial performance gains achievable with ideal subpixel token positioning, drastically reducing the number of tokens necessary for accurate predictions during inference. SPoT provides a new direction for flexible, efficient, and interpretable ViT architectures, redefining sparsity as a strategic advantage rather than an imposed limitation.
中文总结:提出的子像素标记定位(SPoT)方法通过连续标记定位突破网格限制,利用策略性稀疏优势显著减少标记数量并提升模型性能。
English summary: The proposed Subpixel Placement of Tokens (SPoT) method enables continuous token positioning within images to overcome grid limitations, significantly reducing token requirements while improving performance through strategic sparsity utilization.
Authors:Ghasem Alipoor, Karl Skretting
Abstract:
We propose an efficient online dictionary learning algorithm for kernel-based sparse representations. In this framework, input signals are nonlinearly mapped to a high-dimensional feature space and represented sparsely using a virtual dictionary. At each step, the dictionary is updated recursively using a novel algorithm based on the recursive least squares (RLS) method. This update mechanism works with single samples or mini-batches and maintains low computational complexity. Experiments on four datasets across different domains show that our method not only outperforms existing online kernel dictionary learning approaches but also achieves classification accuracy close to that of batch-trained models, while remaining significantly more efficient.
Chinese: 我们提出了一种基于核稀疏表示的高效在线字典学习算法,采用递归最小二乘法更新字典,在多个数据集上实现了接近批量训练的精度,同时保持显著更高的计算效率。
English: We introduce an efficient online dictionary learning algorithm for kernel-based sparse representations that uses recursive least squares for dictionary updates, achieving near-batch model accuracy with superior computational efficiency across multiple datasets.
Authors:Boyuan Sun, Modi Jin, Bowen Yin, Qibin Hou
Abstract:
We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and sensor-induced distortions. To overcome the challenges of data scarcity and the inability of generating high-quality pseudo-labels from corrupted images, we propose an unsupervised consistency regularization finetuning paradigm that requires only a relatively small amount of unlabeled data. Furthermore, we propose the Spatial Distance Constraint to explicitly enforce the model to learn patch-level relative relationships, resulting in clearer semantic boundaries and more accurate details. Experimental results demonstrate the zero-shot capabilities of DepthAnything-AC across diverse benchmarks, including real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks.
Project Page: https://ghost233lism.github.io/depthanything-AC-page
Code: https://github.com/HVision-NKU/DepthAnythingAC
中文总结:DepthAnything-AC是一种基础单目深度估计模型,通过无监督一致性正则化和空间距离约束处理多样化环境条件,在各类基准测试中展现出卓越的零样本能力。
English Summary: DepthAnything-AC is a foundation monocular depth estimation model that handles diverse environmental conditions through unsupervised consistency regularization and a Spatial Distance Constraint, demonstrating strong zero-shot performance across various benchmarks.
Authors:Georgii Levtsov, Dmitry Ustalov
Abstract:
With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley-Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders among models with lower global scores, especially where quality metrics are hard to define (e.g., text generation), though they require more comparisons to converge if ties are frequent. Our code and data are available at https://github.com/HSPyroblast/srw-ranking under a permissive license.
中文:本研究比较了自然语言处理模型评估中的全局评分与成对比较方法,发现全局评分能提供可靠的总体排序但可能低估存在罕见错误的优秀模型,而成对比较在文本生成等复杂场景中更擅长识别顶尖模型但需要更多数据来解决平局问题。
English: This study compares global scoring and pairwise comparisons for evaluating NLP models, finding that global scores offer reliable overall rankings but may overlook strong models with rare errors, while pairwise comparisons better identify top performers in complex scenarios like text generation but require more data to resolve ties.
Authors:Camille Billouard, Dawa Derksen, Alexandre Constantin, Bruno Vallet
Abstract:
Neural Radiance Fields (NeRF) have recently emerged as a paradigm for 3D reconstruction from multiview satellite imagery. However, state-of-the-art NeRF methods are typically constrained to small scenes due to the memory footprint during training, which we study in this paper. Previous work on large-scale NeRFs palliate this by dividing the scene into NeRFs. This paper introduces Snake-NeRF, a framework that scales to large scenes. Our out-of-core method eliminates the need to load all images and networks simultaneously, and operates on a single device. We achieve this by dividing the region of interest into NeRFs that 3D tile without overlap. Importantly, we crop the images with overlap to ensure each NeRFs is trained with all the necessary pixels. We introduce a novel $2\times 2$ 3D tile progression strategy and segmented sampler, which together prevent 3D reconstruction errors along the tile edges. Our experiments conclude that large satellite images can effectively be processed with linear time complexity, on a single GPU, and without compromise in quality.
中文: Snake-NeRF是一种可扩展的框架,通过将场景划分为无重叠的3D区块并在单GPU上高效处理,实现了卫星图像的大规模三维重建且不损失质量。
English: Snake-NeRF is a scalable framework that enables large-scale 3D reconstruction from satellite imagery by dividing scenes into non-overlapping 3D tiles and processing them efficiently on a single GPU without quality loss.
Authors:Yuxiao Wang, Yu Lei, Zhenao Wei, Weiying Xue, Xinyu Jiang, Nan Zhuang, Qi Liu
Abstract:
The task of Human-Object conTact (HOT) detection involves identifying the specific areas of the human body that are touching objects. Nevertheless, current models are restricted to just one type of image, often leading to too much segmentation in areas with little interaction, and struggling to maintain category consistency within specific regions. To tackle this issue, a HOT framework, termed \textbf{P3HOT}, is proposed, which blends \textbf{P}rompt guidance and human \textbf{P}roximal \textbf{P}erception. To begin with, we utilize a semantic-driven prompt mechanism to direct the network's attention towards the relevant regions based on the correlation between image and text. Then a human proximal perception mechanism is employed to dynamically perceive key depth range around the human, using learnable parameters to effectively eliminate regions where interactions are not expected. Calculating depth resolves the uncertainty of the overlap between humans and objects in a 2D perspective, providing a quasi-3D viewpoint. Moreover, a Regional Joint Loss (RJLoss) has been created as a new loss to inhibit abnormal categories in the same area. A new evaluation metric called ``AD-Acc.'' is introduced to address the shortcomings of existing methods in addressing negative samples. Comprehensive experimental results demonstrate that our approach achieves state-of-the-art performance in four metrics across two benchmark datasets. Specifically, our model achieves an improvement of \textbf{0.7}$\uparrow$, \textbf{2.0}$\uparrow$, \textbf{1.6}$\uparrow$, and \textbf{11.0}$\uparrow$ in SC-Acc., mIoU, wIoU, and AD-Acc. metrics, respectively, on the HOT-Annotated dataset. The sources code are available at https://github.com/YuxiaoWang-AI/P3HOT.
中文:提出的P3HOT框架通过结合提示引导和人体邻近感知技术,提升了人-物接触检测的区域关注度和交互准确性,在基准数据集上实现了多项指标的领先性能。
English: The proposed P3HOT framework enhances human-object contact detection by integrating prompt guidance and human proximal perception to improve regional focus and interaction accuracy, achieving state-of-the-art results across multiple metrics on benchmark datasets.
Authors:Xu Zhang, Ming Lu, Yan Chen, Zhan Ma
Abstract:
In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire vision model, which is computationally intensive, especially for large models. To address these problems, we introduce Perception-Oriented Latent Coding (POLC), an approach that enriches the semantic content of latent features for high-performance compressed domain semantic inference. With the semantically rich latent space, POLC requires only a plug-and-play adapter for fine-tuning, significantly reducing the parameter count compared to previous MSE-oriented methods. Experimental results demonstrate that POLC achieves rate-perception performance comparable to state-of-the-art generative image coding methods while markedly enhancing performance in vision tasks, with minimal fine-tuning overhead. Code is available at https://github.com/NJUVISION/POLC.
中文: 本文提出感知导向的潜在编码(POLC),通过增强压缩图像的语义丰富性,仅需即插即用的适配器微调即可显著提升视觉任务性能,大幅减少计算开销。
English: This paper introduces Perception-Oriented Latent Coding (POLC) to enhance semantic richness in compressed images, enabling superior vision task performance with minimal fine-tuning through a plug-and-play adapter.
Authors:Youngjin Oh, Junhyeong Kwon, Keuntek Lee, Nam Ik Cho
Abstract:
Recent deep learning-based image denoising methods have shown impressive performance; however, many lack the flexibility to adjust the denoising strength based on the noise levels, camera settings, and user preferences. In this paper, we introduce a new controllable denoising framework that adaptively removes noise from images by utilizing information from camera parameters. Specifically, we focus on ISO, shutter speed, and F-number, which are closely related to noise levels. We convert these selected parameters into a vector to control and enhance the performance of the denoising network. Experimental results show that our method seamlessly adds controllability to standard denoising neural networks and improves their performance. Code is available at https://github.com/OBAKSA/CPADNet.
中文: 本文提出了一种可控图像去噪框架,通过利用ISO、快门速度和光圈值等相机参数自适应去除噪声,不仅增强了标准去噪网络的灵活性,还提升了其性能。
English: This paper presents a controllable image denoising framework that adaptively removes noise by incorporating camera parameters like ISO, shutter speed, and F-number, enhancing both flexibility and performance of standard denoising networks.
Authors:Hao Wang, Keyan Hu, Xin Guo, Haifeng Li, Chao Tao
Abstract:
Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model's ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework's capability of consistent boundary refinement for coarse results from diverse discriminative architectures. The source code will be available at https://github.com/KeyanHu-git/IDGBR.
Chinese Summary: IDGBR框架融合判别式与生成式学习,通过扩散去噪过程优化粗分割结果,有效提升了遥感语义分割中边界定位的精确度。
English Summary: The IDGBR framework combines discriminative and generative learning to enhance boundary precision in remote sensing semantic segmentation by refining coarse segmentation maps through a diffusion denoising process.
Authors:Benjamin Feuer, Lennart Purucker, Oussama Elachqar, Chinmay Hegde
Abstract:
Scientific applications of machine learning often rely on small, specialized models tuned to particular domains. Such models often achieve excellent performance, but lack flexibility. Foundation models offer versatility, but typically underperform specialized approaches, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables even small vision-language models to predict any data modality with high accuracy. MARVIS transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to successfully interpret and utilize them. MARVIS achieves competitive performance on vision, audio, biological, and tabular domains using a single 3B parameter model, achieving results that beat Gemini by 16\% on average and approach specialized methods, without exposing personally identifiable information (P.I.I.) or requiring any domain-specific training. We open source our code and datasets at https://github.com/penfever/marvis
中文摘要:MARVIS是一种无需训练的方法,通过将嵌入转换为视觉表征,使小型视觉语言模型能准确解读多种数据模态,在多个领域实现优异性能,且无需暴露个人身份信息或进行领域特定训练。
English Summary: MARVIS is a training-free method that enables small vision-language models to accurately interpret various data modalities by transforming embeddings into visual representations, achieving competitive performance across multiple domains without exposing PII or requiring domain-specific training.
Authors:Quang Minh Nguyen, Taegyoon Kim
Abstract:
In the stance detection task, a text is classified as either favorable, opposing, or neutral towards a target. Prior work suggests that the use of external information, e.g., excerpts from Wikipedia, improves stance detection performance. However, whether or not such information can benefit large language models (LLMs) remains an unanswered question, despite their wide adoption in many reasoning tasks. In this study, we conduct a systematic evaluation on how Wikipedia and web search external information can affect stance detection across eight LLMs and in three datasets with 12 targets. Surprisingly, we find that such information degrades performance in most cases, with macro F1 scores dropping by up to 27.9\%. We explain this through experiments showing LLMs' tendency to align their predictions with the stance and sentiment of the provided information rather than the ground truth stance of the given text. We also find that performance degradation persists with chain-of-thought prompting, while fine-tuning mitigates but does not fully eliminate it. Our findings, in contrast to previous literature on BERT-based systems which suggests that external information enhances performance, highlight the risks of information biases in LLM-based stance classifiers. Code is available at https://github.com/ngqm/acl2025-stance-detection.
中文: 本研究发现,与以往基于BERT系统的研究相反,外部信息如维基百科和网络搜索通常会降低大型语言模型在立场检测中的性能,因为它们倾向于与所提供信息的立场对齐,而非文本的真实立场。
English: This study finds that, contrary to previous research on BERT-based systems, external information like Wikipedia and web search often degrades stance detection performance in large language models (LLMs) by causing them to align with the provided information's bias rather than the text's actual stance.
Authors:Robert Aufschläger, Youssef Shoeb, Azarm Nowzad, Michael Heigl, Fabian Bally, Martin Schramm
Abstract:
The collection and release of street-level recordings as Open Data play a vital role in advancing autonomous driving systems and AI research. However, these datasets pose significant privacy risks, particularly for pedestrians, due to the presence of Personally Identifiable Information (PII) that extends beyond biometric traits such as faces. In this paper, we present cRID, a novel cross-modal framework combining Large Vision-Language Models, Graph Attention Networks, and representation learning to detect textual describable clues of PII and enhance person re-identification (Re-ID). Our approach focuses on identifying and leveraging interpretable features, enabling the detection of semantically meaningful PII beyond low-level appearance cues. We conduct a systematic evaluation of PII presence in person image datasets. Our experiments show improved performance in practical cross-dataset Re-ID scenarios, notably from Market-1501 to CUHK03-np (detected), highlighting the framework's practical utility. Code is available at https://github.com/RAufschlaeger/cRID.
中文摘要:本文提出cRID跨模态框架,通过检测可文本描述的个人身份信息并利用可解释特征分析,在街景图像中增强隐私保护并提升行人重识别性能。
English Summary: The paper introduces cRID, a cross-modal framework that enhances privacy protection in street-level imagery by detecting textual describable PII and improving person re-identification through interpretable feature analysis.
Authors:Chentao Shen, Ding Pan, Mingyu Mei, Zaixing He, Xinyue Zhao
Abstract:
Visual pose tracking is playing an increasingly vital role in industrial contexts in recent years. However, the pose tracking for industrial metal objects remains a challenging task especially in the real world-environments, due to the reflection characteristic of metal objects. To address this issue, we propose a novel 6DoF pose tracking method based on active control points. The method uses image control points to generate edge feature for optimization actively instead of 6DoF pose-based rendering, and serve them as optimization variables. We also introduce an optimal control point regression method to improve robustness. The proposed tracking method performs effectively in both dataset evaluation and real world tasks, providing a viable solution for real-time tracking of industrial metal objects. Our source code is made publicly available at: https://github.com/tomatoma00/ACPTracking.
中文: 本文提出了一种基于主动控制点的新型6DoF姿态跟踪方法,有效解决了工业金属物体因反光特性导致的跟踪难题,在数据集评估和实际应用中均表现优异。
English: This paper introduces a novel 6DoF pose tracking method using active control points to overcome the challenges of tracking reflective industrial metal objects, demonstrating effectiveness in both datasets and real-world applications.
Authors:Jonáš Herec, VÃt RůžiÄka, Rado PitoÅák
Abstract:
Methane is a potent greenhouse gas, and detecting its leaks early via hyperspectral satellite imagery can help mitigate climate change. Meanwhile, many existing missions operate in manual tasking regimes only, thus missing potential events of interest. To overcome slow downlink rates cost-effectively, onboard detection is a viable solution. However, traditional methane enhancement methods are too computationally demanding for resource-limited onboard hardware. This work accelerates methane detection by focusing on efficient, low-power algorithms. We test fast target detection methods (ACE, CEM) that have not been previously used for methane detection and propose a Mag1c-SAS - a significantly faster variant of the current state-of-the-art algorithm for methane detection: Mag1c. To explore their true detection potential, we integrate them with a machine learning model (U-Net, LinkNet). Our results identify two promising candidates (Mag1c-SAS and CEM), both acceptably accurate for the detection of strong plumes and computationally efficient enough for onboard deployment: one optimized more for accuracy, the other more for speed, achieving up to ~100x and ~230x faster computation than original Mag1c on resource-limited hardware. Additionally, we propose and evaluate three band selection strategies. One of them can outperform the method traditionally used in the field while using fewer channels, leading to even faster processing without compromising accuracy. This research lays the foundation for future advancements in onboard methane detection with minimal hardware requirements, improving timely data delivery. The produced code, data, and models are open-sourced and can be accessed from https://github.com/zaitra/methane-filters-benchmark.
中文: 本研究针对星载高光谱图像甲烷泄漏检测,提出了低功耗高效算法Mag1c-SAS和CEM,在保持强羽流识别精度的同时,运算速度较现有标准提升最高达230倍,为资源受限的硬件平台提供了可行的实时解决方案。
English: This study introduces efficient, low-power algorithms for onboard methane leak detection from hyperspectral imagery, proposing Mag1c-SAS and CEM methods that achieve up to 230x faster computation than current standards while maintaining accuracy for strong plume identification.
Authors:Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, Xiang Li
Abstract:
REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https://github.com/Martinser/REG.
中文: 本研究提出的REG方法通过将低层级图像潜在变量与高层级类别标记相融合,有效提升了扩散模型的性能,能够从纯噪声中高效生成协调的图像-类别对,在显著加速训练过程并提高生成质量的同时,仅需极小的额外计算开销。
English: The proposed REG method enhances diffusion models by entangling low-level image latents with a high-level class token, enabling efficient generation of coherent image-class pairs from noise with minimal computational overhead and significant improvements in training speed and quality.
Authors:Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, Xiang Li
Abstract:
REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https://github.com/Martinser/REG.
中文: 本研究提出的REG方法通过将低层级图像潜在变量与高层级类别标记相融合,有效提升了扩散模型的性能,能够从纯噪声中高效生成协调的图像-类别对,在显著加速训练过程并提高生成质量的同时,仅需极小的额外计算开销。
English: The proposed REG method enhances diffusion models by entangling low-level image latents with a high-level class token, enabling efficient generation of coherent image-class pairs from noise with minimal computational overhead and significant improvements in training speed and quality.
Authors:Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun
Abstract:
Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec.
中文: LogitSpec 是一种无需训练、即插即用的方法,通过利用最后一个令牌的 logit 推测并检索草稿令牌来加速大语言模型推理,实现了高达 2.61 倍的加速和更高的令牌接受率。
English: LogitSpec is a training-free, plug-and-play method that accelerates LLM inference by using the last token's logit to speculate and retrieve draft tokens, achieving up to 2.61× speedup and higher token acceptance rates.
Authors:Shaocheng Yan, Pengcheng Shi, Zhenjun Zhao, Kaixin Wang, Kuang Cao, Ji Wu, Jiayuan Li
Abstract:
Robust estimation is essential in correspondence-based Point Cloud Registration (PCR). Existing methods using maximal clique search in compatibility graphs achieve high recall but suffer from exponential time complexity, limiting their use in time-sensitive applications. To address this challenge, we propose a fast and robust estimator, TurboReg, built upon a novel lightweight clique, TurboClique, and a highly parallelizable Pivot-Guided Search (PGS) algorithm. First, we define the TurboClique as a 3-clique within a highly-constrained compatibility graph. The lightweight nature of the 3-clique allows for efficient parallel searching, and the highly-constrained compatibility graph ensures robust spatial consistency for stable transformation estimation. Next, PGS selects matching pairs with high SC$^2$ scores as pivots, effectively guiding the search toward TurboCliques with higher inlier ratios. Moreover, the PGS algorithm has linear time complexity and is significantly more efficient than the maximal clique search with exponential time complexity. Extensive experiments show that TurboReg achieves state-of-the-art performance across multiple real-world datasets, with substantial speed improvements. For example, on the 3DMatch+FCGF dataset, TurboReg (1K) operates $208.22\times$ faster than 3DMAC while also achieving higher recall. Our code is accessible at \href{https://github.com/Laka-3DV/TurboReg}{\texttt{TurboReg}}.
中文: TurboReg通过轻量级TurboCliques和并行化Pivot-Guided Search算法,提出了一种快速鲁棒的点云配准估计器,在保持线性时间复杂度的同时实现了最优性能与显著加速。
English: TurboReg introduces a fast and robust PCR estimator using lightweight TurboCliques and a parallel Pivot-Guided Search, achieving state-of-the-art performance with linear time complexity and significant speed gains.
Authors:Chen Sun, Haiyang Sun, Zhiqing Guo, Yunfeng Diao, Liejun Wang, Dan Ma, Gaobo Yang, Keqin Li
Abstract:
Deepfakes pose significant security and privacy threats through malicious facial manipulations. While robust watermarking can aid in authenticity verification and source tracking, existing methods often lack the sufficient robustness against Deepfake manipulations. Diffusion models have demonstrated remarkable performance in image generation, enabling the seamless fusion of watermark with image during generation. In this study, we propose a novel robust watermarking framework based on diffusion model, called DiffMark. By modifying the training and sampling scheme, we take the facial image and watermark as conditions to guide the diffusion model to progressively denoise and generate corresponding watermarked image. In the construction of facial condition, we weight the facial image by a timestep-dependent factor that gradually reduces the guidance intensity with the decrease of noise, thus better adapting to the sampling process of diffusion model. To achieve the fusion of watermark condition, we introduce a cross information fusion (CIF) module that leverages a learnable embedding table to adaptively extract watermark features and integrates them with image features via cross-attention. To enhance the robustness of the watermark against Deepfake manipulations, we integrate a frozen autoencoder during training phase to simulate Deepfake manipulations. Additionally, we introduce Deepfake-resistant guidance that employs specific Deepfake model to adversarially guide the diffusion sampling process to generate more robust watermarked images. Experimental results demonstrate the effectiveness of the proposed DiffMark on typical Deepfakes. Our code will be available at https://github.com/vpsg-research/DiffMark.
中文摘要:本研究提出基于扩散模型的鲁棒水印框架DiffMark,通过面部条件加权和交叉信息融合模块自适应嵌入水印,并采用对抗性训练策略提升对深度伪造攻击的防御能力。
English Summary: This study introduces DiffMark, a robust watermarking framework based on diffusion models that integrates facial images and watermarks through adaptive conditioning and cross-attention fusion to enhance resistance against Deepfake manipulations.
Authors:Kuniaki Saito, Donghyun Kim, Kwanyong Park, Atsushi Hashimoto, Yoshitaka Ushiku
Abstract:
An image captioning model flexibly switching its language pattern, e.g., descriptiveness and length, should be useful since it can be applied to diverse applications. However, despite the dramatic improvement in generative vision-language models, fine-grained control over the properties of generated captions is not easy due to two reasons: (i) existing models are not given the properties as a condition during training and (ii) existing models cannot smoothly transition its language pattern from one state to the other. Given this challenge, we propose a new approach, CaptionSmiths, to acquire a single captioning model that can handle diverse language patterns. First, our approach quantifies three properties of each caption, length, descriptiveness, and uniqueness of a word, as continuous scalar values, without human annotation. Given the values, we represent the conditioning via interpolation between two endpoint vectors corresponding to the extreme states, e.g., one for a very short caption and one for a very long caption. Empirical results demonstrate that the resulting model can smoothly change the properties of the output captions and show higher lexical alignment than baselines. For instance, CaptionSmiths reduces the error in controlling caption length by 506\% despite better lexical alignment. Code will be available on https://github.com/omron-sinicx/captionsmiths.
Chinese: 提出的CaptionSmiths模型通过将标题属性量化为连续值并采用向量插值方法,实现了对标题长度和描述性等特征的精细控制,相比现有方法能更流畅地调整语言模式并提升词汇对齐效果。
English: The proposed CaptionSmiths model enables fine-grained control over caption properties like length and descriptiveness by quantifying them as continuous values and using vector interpolation, achieving smoother transitions and better lexical alignment than existing methods.
Authors:Huanwen Liang, Jingxian Xu, Yuanji Zhang, Yuhao Huang, Yuhan Zhang, Xin Yang, Ran Li, Xuedong Deng, Yanjun Liu, Guowei Tao, Yun Wu, Sheng Zhao, Xinru Gao, Dong Ni
Abstract:
Fetal abdominal malformations are serious congenital anomalies that require accurate diagnosis to guide pregnancy management and reduce mortality. Although AI has demonstrated significant potential in medical diagnosis, its application to prenatal abdominal anomalies remains limited. Most existing studies focus on image-level classification and rely on standard plane localization, placing less emphasis on case-level diagnosis. In this paper, we develop a case-level multiple instance learning (MIL)-based method, free of standard plane localization, for classifying fetal abdominal anomalies in prenatal ultrasound. Our contribution is three-fold. First, we adopt a mixture-of-attention-experts module (MoAE) to weight different attention heads for various planes. Secondly, we propose a medical-knowledge-driven feature selection module (MFS) to align image features with medical knowledge, performing self-supervised image token selection at the case-level. Finally, we propose a prompt-based prototype learning (PPL) to enhance the MFS. Extensively validated on a large prenatal abdominal ultrasound dataset containing 2,419 cases, with a total of 24,748 images and 6 categories, our proposed method outperforms the state-of-the-art competitors. Codes are available at:https://github.com/LL-AC/AAcls.
中文: 本研究提出了一种新颖的病例级多示例学习方法,用于产前超声中胎儿腹部畸形的分类,无需标准切面定位,通过整合注意力机制、医学知识和原型学习,在大型数据集上实现了卓越性能。
English: This study introduces a novel case-level multiple instance learning method for classifying fetal abdominal anomalies in prenatal ultrasound, which eliminates the need for standard plane localization and integrates attention mechanisms, medical knowledge, and prototype learning to achieve superior performance on a large dataset.
Authors:Langyu Wang, Bingke Zhu, Yingying Chen, Yiyuan Zhang, Ming Tang, Jinqiao Wang
Abstract:
The weakly-supervised audio-visual video parsing (AVVP) aims to predict all modality-specific events and locate their temporal boundaries. Despite significant progress, due to the limitations of the weakly-supervised and the deficiencies of the model architecture, existing methods are lacking in simultaneously improving both the segment-level prediction and the event-level prediction. In this work, we propose a audio-visual Mamba network with pseudo labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and excluding the noise interference from the alternate modalities. Specifically, we annotate some of the pseudo-labels based on previous work. Using unimodal pseudo-labels, we perform cross-modal random combinations to generate new data, which can enhance the model's ability to parse various segment-level event combinations. For feature processing and interaction, we employ a audio-visual mamba network. The AV-Mamba enhances the ability to perceive different segments and excludes additional modal noise while sharing similar modal information. Our extensive experiments demonstrate that MUG improves state-of-the-art results on LLP dataset in all metrics (e.g,, gains of 2.1% and 1.2% in terms of visual Segment-level and audio Segment-level metrics). Our code is available at https://github.com/WangLY136/MUG.
Chinese Summary: 本文提出了一种带有伪标签增强的视听Mamba网络(MUG),通过生成跨模态训练数据和减少噪声干扰来提升片段级事件解析能力,在LLP数据集的所有指标上均实现了最先进的性能提升。
English Summary: This paper introduces a novel audio-visual Mamba network with pseudo-label augmentation (MUG) that enhances segment-level event parsing by generating cross-modal training data and reducing noise interference, achieving state-of-the-art performance improvements on the LLP dataset.
Authors:Tianrui Lou, Xiaojun Jia, Siyuan Liang, Jiawei Liang, Ming Zhang, Yanjun Xiao, Xiaochun Cao
Abstract:
Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. Camouflage-based physical attack is a more promising approach compared to the patch-based attack, offering stronger adversarial effectiveness in complex physical environments. However, most prior work relies on mesh priors of the target object and virtual environments constructed by simulators, which are time-consuming to obtain and inevitably differ from the real world. Moreover, due to the limitations of the backgrounds in training images, previous methods often fail to produce multi-view robust adversarial camouflage and tend to fall into sub-optimal solutions. Due to these reasons, prior work lacks adversarial effectiveness and robustness across diverse viewpoints and physical environments. We propose a physical attack framework based on 3D Gaussian Splatting (3DGS), named PGA, which provides rapid and precise reconstruction with few images, along with photo-realistic rendering capabilities. Our framework further enhances cross-view robustness and adversarial effectiveness by preventing mutual and self-occlusion among Gaussians and employing a min-max optimization approach that adjusts the imaging background of each viewpoint, helping the algorithm filter out non-robust adversarial features. Extensive experiments validate the effectiveness and superiority of PGA. Our code is available at:https://github.com/TRLou/PGA.
中文: 提出的PGA框架利用3D高斯溅射技术,通过少量图像快速重建并逼真渲染场景,借助防止遮挡和针对各视角背景调整的极小极大优化方法,显著提升了对抗迷彩在多视角下的有效性和鲁棒性。
English: The proposed PGA framework leverages 3D Gaussian Splatting to rapidly reconstruct and render realistic scenes from few images, enhancing adversarial camouflage effectiveness and cross-view robustness through occlusion prevention and min-max optimization that adapts backgrounds across viewpoints.
Authors:Cuong Le, Huy-Phuong Le, Duc Le, Minh-Thien Duong, Van-Binh Nguyen, My-Ha Le
Abstract:
Body dynamics are crucial information for the analysis of human motions in important research fields, ranging from biomechanics, sports science to computer vision and graphics. Modern approaches collect the body dynamics, external reactive force specifically, via force plates, synchronizing with human motion capture data, and learn to estimate the dynamics from a black-box deep learning model. Being specialized devices, force plates can only be installed in laboratory setups, imposing a significant limitation on the learning of human dynamics. To this end, we propose a novel method for estimating human ground reaction dynamics directly from the more reliable motion capture data with physics laws and computational simulation as constrains. We introduce a highly accurate and robust method for computing ground reaction forces from motion capture data using Euler's integration scheme and PD algorithm. The physics-based reactive forces are used to inform the learning model about the physics-informed motion dynamics thus improving the estimation accuracy. The proposed approach was tested on the GroundLink dataset, outperforming the baseline model on: 1) the ground reaction force estimation accuracy compared to the force plates measurement; and 2) our simulated root trajectory precision. The implementation code is available at https://github.com/cuongle1206/Phys-GRD
中文: 本文提出了一种基于物理的方法,利用运动捕捉数据结合欧拉积分和PD控制来精确估计地面反作用力,在力估计精度和轨迹精度方面均优于基线模型。
English: This paper introduces a physics-based method that uses motion capture data with Euler's integration and PD control to accurately estimate ground reaction forces, outperforming baseline models in both force estimation and trajectory precision.
Authors:Dong Liang, Xingyu Qiu, Yuzhen Li, Wei Wang, Kuanquan Wang, Suyu Dong, Gongning Luo
Abstract:
MR imaging techniques are of great benefit to disease diagnosis. However, due to the limitation of MR devices, significant intensity inhomogeneity often exists in imaging results, which impedes both qualitative and quantitative medical analysis. Recently, several unsupervised deep learning-based models have been proposed for MR image improvement. However, these models merely concentrate on global appearance learning, and neglect constraints from image structures and smoothness of bias field, leading to distorted corrected results. In this paper, novel structure and smoothness constrained dual networks, named S2DNets, are proposed aiming to self-supervised bias field correction. S2DNets introduce piece-wise structural constraints and smoothness of bias field for network training to effectively remove non-uniform intensity and retain much more structural details. Extensive experiments executed on both clinical and simulated MR datasets show that the proposed model outperforms other conventional and deep learning-based models. In addition to comparison on visual metrics, downstream MR image segmentation tasks are also used to evaluate the impact of the proposed model. The source code is available at: https://github.com/LeongDong/S2DNets}{https://github.com/LeongDong/S2DNets.
中文: 本文提出S2DNets,一种新型自监督双网络,通过引入分段结构约束和偏置场平滑性来有效校正MR图像强度不均匀性并保留结构细节,实验表明其性能优于现有方法。
English: This paper introduces S2DNets, a novel self-supervised dual network that incorporates structural constraints and bias field smoothness to effectively correct intensity inhomogeneity in MR images while preserving structural details, outperforming existing methods in experiments.
Authors:Ahmad Chaddad, Jihao Peng, Yihang Wu
Abstract:
The use of deep learning (DL) in medical image analysis has significantly improved the ability to predict lung cancer. In this study, we introduce a novel deep convolutional neural network (CNN) model, named ResNet+, which is based on the established ResNet framework. This model is specifically designed to improve the prediction of lung cancer and diseases using the images. To address the challenge of missing feature information that occurs during the downsampling process in CNNs, we integrate the ResNet-D module, a variant designed to enhance feature extraction capabilities by modifying the downsampling layers, into the traditional ResNet model. Furthermore, a convolutional attention module was incorporated into the bottleneck layers to enhance model generalization by allowing the network to focus on relevant regions of the input images. We evaluated the proposed model using five public datasets, comprising lung cancer (LC2500 $n$=3183, IQ-OTH/NCCD $n$=1336, and LCC $n$=25000 images) and lung disease (ChestXray $n$=5856, and COVIDx-CT $n$=425024 images). To address class imbalance, we used data augmentation techniques to artificially increase the representation of underrepresented classes in the training dataset. The experimental results show that ResNet+ model demonstrated remarkable accuracy/F1, reaching 98.14/98.14\% on the LC25000 dataset and 99.25/99.13\% on the IQ-OTH/NCCD dataset. Furthermore, the ResNet+ model saved computational cost compared to the original ResNet series in predicting lung cancer images. The proposed model outperformed the baseline models on publicly available datasets, achieving better performance metrics. Our codes are publicly available at https://github.com/AIPMLab/Graduation-2024/tree/main/Peng.
中文: 本研究提出ResNet+增强型深度卷积神经网络,通过整合ResNet-D模块和注意力机制来提升医学影像中肺癌及肺部疾病的预测能力,在多个公共数据集上实现了高精度和计算效率。
English: This study introduces ResNet+, an enhanced deep convolutional neural network that integrates a ResNet-D module and attention mechanism to improve lung cancer and disease prediction from medical images, achieving high accuracy and computational efficiency across multiple public datasets.
Authors:Zhuo Su, Li Liu, Matthias Müller, Jiehua Zhang, Diana Wofk, Ming-Ming Cheng, Matti Pietikäinen
Abstract:
This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance. While recent advances in deep neural networks have improved SOD, existing top-leading models are computationally expensive. We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs. Like biologically-inspired classical SOD methods relying on computing contrast cues to determine saliency of image regions, our model leverages Pixel Difference Convolutions (PDCs) to encode the feature contrasts. Differently, PDCs are incorporated in a CNN architecture so that the valuable contrast cues are extracted from rich feature maps. For efficiency, we introduce a difference convolution reparameterization (DCR) strategy that embeds PDCs into standard convolutions, eliminating computation and parameters at inference. Additionally, we introduce SpatioTemporal Difference Convolution (STDC) for video SOD, enhancing the standard 3D convolution with spatiotemporal contrast capture. Our models, SDNet for image SOD and STDNet for video SOD, achieve significant improvements in efficiency-accuracy trade-offs. On a Jetson Orin device, our models with $<$ 1M parameters operate at 46 FPS and 150 FPS on streamed images and videos, surpassing the second-best lightweight models in our experiments by more than $2\times$ and $3\times$ in speed with superior accuracy. Code will be available at https://github.com/hellozhuo/stdnet.git.
中文: 本文提出了一种结合传统对比度方法与现代卷积神经网络的高效显著目标检测模型,通过像素差分卷积在资源受限设备上实现了更优的速度与精度平衡。
English: This paper introduces efficient models for salient object detection that combine traditional contrast-based methods with modern CNNs using Pixel Difference Convolutions, achieving superior speed-accuracy trade-offs on resource-constrained devices.
Authors:Simon Börjesson, Erik Ersmark, Pierre Nugues
Abstract:
The \textit{Nordisk familjebok} is a Swedish encyclopedia from the 19th and 20th centuries. It was written by a team of experts and aimed to be an intellectual reference, stressing precision and accuracy. This encyclopedia had four main editions remarkable by their size, ranging from 20 to 38 volumes. As a consequence, the \textit{Nordisk familjebok} had a considerable influence in universities, schools, the media, and society overall. As new editions were released, the selection of entries and their content evolved, reflecting intellectual changes in Sweden.
In this paper, we used digitized versions from \textit{Project Runeberg}. We first resegmented the raw text into entries and matched pairs of entries between the first and second editions using semantic sentence embeddings. We then extracted the geographical entries from both editions using a transformer-based classifier and linked them to Wikidata. This enabled us to identify geographic trends and possible shifts between the first and second editions, written between 1876-1899 and 1904-1926, respectively.
Interpreting the results, we observe a small but significant shift in geographic focus away from Europe and towards North America, Africa, Asia, Australia, and northern Scandinavia from the first to the second edition, confirming the influence of the First World War and the rise of new powers. The code and data are available on GitHub at https://github.com/sibbo/nordisk-familjebok.
中文: 本研究通过分析瑞典百科全书《Nordisk familjebok》的数字化版本,发现其第一版到第二版间地理关注点从欧洲显著转向北美、非洲、亚洲、澳大利亚和北欧地区,反映了第一次世界大战等全球性变革的影响。
English: This study analyzes digitized versions of the Swedish encyclopedia *Nordisk familjebok* to identify a notable shift in geographic focus from Europe toward North America, Africa, Asia, Australia, and northern Scandinavia between its first and second editions, reflecting global changes such as World War I.
Authors:Liangyu Wang, Junxiao Wang, Jie Ren, Zihang Xiang, David E. Keyes, Di Wang
Abstract:
As large language models (LLMs) increasingly underpin technological advancements, the privacy of their training data emerges as a critical concern. Differential Privacy (DP) serves as a rigorous mechanism to protect this data, yet its integration via Differentially Private Stochastic Gradient Descent (DP-SGD) introduces substantial challenges, primarily due to the complexities of per-sample gradient clipping. Current explicit methods, such as Opacus, necessitate extensive storage for per-sample gradients, significantly inflating memory requirements. Conversely, implicit methods like GhostClip reduce storage needs by recalculating gradients multiple times, which leads to inefficiencies due to redundant computations. This paper introduces FlashDP, an innovative cache-friendly per-layer DP-SGD that consolidates necessary operations into a single task, calculating gradients only once in a fused manner. This approach not only diminishes memory movement by up to \textbf{50\%} but also cuts down redundant computations by \textbf{20\%}, compared to previous methods. Consequently, FlashDP does not increase memory demands and achieves a \textbf{90\%} throughput compared to the Non-DP method on a four-A100 system during the pre-training of the Llama-13B model, while maintaining parity with standard per-layer clipped DP-SGD in terms of accuracy. These advancements establish FlashDP as a pivotal development for efficient and privacy-preserving training of LLMs. FlashDP's code has been open-sourced in https://github.com/kaustpradalab/flashdp.
中文: FlashDP提出了一种缓存友好的逐层差分隐私随机梯度下降方法,将内存移动减少50%,冗余计算降低20%,在保持精度的同时达到非差分隐私方法90%的吞吐量,实现了高效保护隐私的大语言模型训练。
English: FlashDP introduces a cache-friendly per-layer DP-SGD method that reduces memory movement by 50% and redundant computations by 20%, achieving 90% throughput of non-DP methods while maintaining accuracy for efficient privacy-preserving LLM training.
Authors:Brenda Nogueira, Gabe Gomes, Meng Jiang, Nitesh V. Chawla, Nuno Moniz
Abstract:
Graph-structured data is ubiquitous in scientific domains, where models often face imbalanced learning settings. In imbalanced regression, domain preferences focus on specific target value ranges that represent the most scientifically valuable cases; however, we observe a significant lack of research regarding this challenge. In this paper, we present Spectral Manifold Harmonization (SMH), a novel approach to address imbalanced regression challenges on graph-structured data by generating synthetic graph samples that preserve topological properties while focusing on the most relevant target distribution regions. Conventional methods fail in this context because they either ignore graph topology in case generation or do not target specific domain ranges, resulting in models biased toward average target values. Experimental results demonstrate the potential of SMH on chemistry and drug discovery benchmark datasets, showing consistent improvements in predictive performance for target domain ranges. Code is available at https://github.com/brendacnogueira/smh-graph-imbalance.git.
中文: 本文提出谱流形协调方法,通过生成保留拓扑结构并聚焦关键目标区域的合成图样本,有效解决图结构数据中的不平衡回归问题,在化学和药物发现基准测试中展现出优越的预测性能。
English: This paper introduces Spectral Manifold Harmonization (SMH), a novel method that generates synthetic graph samples to address imbalanced regression in graph-structured data by preserving topology and focusing on key target ranges, demonstrating improved predictive performance in chemistry and drug discovery benchmarks.
Authors:Jing Yu, Yibo Zhao, Jiapeng Zhu, Wenming Shao, Bo Pang, Zhao Zhang, Xiang Li
Abstract:
The widespread dissemination of toxic content on social media poses a serious threat to both online environments and public discourse, highlighting the urgent need for detoxification methods that effectively remove toxicity while preserving the original semantics. However, existing approaches often struggle to simultaneously achieve strong detoxification performance, semantic preservation, and robustness to out-of-distribution data. Moreover, they typically rely on costly, manually annotated parallel corpora while showing poor data efficiency. To address these challenges, we propose a two-stage training framework that jointly optimizes for data efficiency, semantic preservation, and model generalization. We first perform supervised fine-tuning on a small set of high-quality, filtered parallel data to establish a strong initialization. Then, we leverage unlabeled toxic inputs and a custom-designed reward model to train the LLM using Group Relative Policy Optimization. Experimental results demonstrate that our method effectively mitigates the trade-offs faced by previous work, achieving state-of-the-art performance with improved generalization and significantly reduced dependence on annotated data. Our code is available at: https://github.com/allacnobug/Detoxification-of-Text.
中文摘要:本研究提出了一种新颖的两阶段训练框架,通过监督微调和强化学习的结合,有效解决了现有文本去毒方法在毒性消除、语义保留和数据效率方面的局限,实现了更优的综合性能。
English Summary: This study introduces a novel two-stage training framework that effectively addresses the limitations of existing text detoxification methods by achieving superior toxicity removal, semantic preservation, and data efficiency through supervised fine-tuning and reinforcement learning.
Authors:Tianxiang Xia, Max Neuwinger, Lin Xiao
Abstract:
Clifford Neural Layers improve PDE modeling by introducing Clifford Algebra into neural networks. In this project we focus on optimizing the inference of 2/3D Clifford convolutional layers and multivector activation layers for one core CPU performance.
Overall, by testing on a real network block involving Clifford convolutional layers and multivector activation layers, we observe that our implementation is 30% faster than standard PyTorch implementation in relatively large data + network size (>L2 cache).
We open source our code base at https://github.com/egretwAlker/c-opt-clifford-layers
Chinese: Clifford神经层通过将克利福德代数引入神经网络改进了偏微分方程建模,在大规模数据和网络场景下比标准PyTorch实现快30%。
English: Clifford Neural Layers enhance PDE modeling by integrating Clifford Algebra into neural networks, achieving a 30% speed improvement over standard PyTorch in large-scale data scenarios.
Authors:Fanchen Bu, Kijung Shin
Abstract:
Geometric learning has emerged as a powerful paradigm for modeling non-Euclidean data, especially graph-structured ones, with applications spanning social networks, molecular structures, knowledge graphs, and recommender systems. While Nvidia's CUDA-enabled graphics processing units (GPUs) largely dominate the hardware landscape, emerging accelerators such as Intel's Gaudi Habana Processing Units (HPUs) offer competitive performance and energy efficiency. However, the usage of such non-CUDA processing units requires significant engineering effort and novel software adaptations. In this work, we present our experiences porting PyTorch-based geometric learning frameworks to Gaudi-v2 HPUs. We introduce a collection of core utilities that restore essential operations (e.g., scatter, sparse indexing, k-nearest neighbors) on Gaudi-v2 HPUs, and we consolidate sixteen guided tutorials and eleven real-world examples with diagnostic analyses of encountered failures and detailed workarounds. We collect all our experiences into a publicly accessible GitHub repository. Our contributions lower the barrier for researchers to experiment with geometric-learning algorithms and models on non-CUDA hardware, providing a foundation for further optimization and cross-platform portability.
中文摘要:本研究实现了基于PyTorch的几何学习框架在英特尔Gaudi-v2 HPU上的移植,通过提供核心工具和丰富实践资源,显著降低了在非CUDA硬件上进行几何学习研究的门槛。
English Summary: This work presents the adaptation of PyTorch-based geometric learning frameworks to Intel Gaudi-v2 HPUs, offering core utilities and practical resources to facilitate research on non-CUDA hardware.
Authors:Zhe Kong, Le Li, Yong Zhang, Feng Gao, Shaoshu Yang, Tao Wang, Kaihao Zhang, Zhuoliang Kang, Xiaoming Wei, Guanying Chen, Wenhan Luo
Abstract:
Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. However, due to the intrinsic image-animation characteristics of SVD, it is challenging to generate fine details using only low-quality videos. To tackle this problem, we propose DAM-VSR, an appearance and motion disentanglement framework for VSR. This framework disentangles VSR into appearance enhancement and motion control problems. Specifically, appearance enhancement is achieved through reference image super-resolution, while motion control is achieved through video ControlNet. This disentanglement fully leverages the generative prior of video diffusion models and the detail generation capabilities of image super-resolution models. Furthermore, equipped with the proposed motion-aligned bidirectional sampling strategy, DAM-VSR can conduct VSR on longer input videos. DAM-VSR achieves state-of-the-art performance on real-world data and AIGC data, demonstrating its powerful detail generation capabilities.
中文: 真实世界视频超分辨率面临时间一致性和细节生成的挑战,提出的DAM-VSR框架通过解耦外观增强与运动控制,利用扩散模型和双向采样策略实现了卓越性能。
English: Real-world video super-resolution faces challenges with temporal consistency and detail generation, which the proposed DAM-VSR framework addresses by disentangling appearance enhancement and motion control using diffusion models and a bidirectional sampling strategy for superior performance.
Authors:V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianyu Tong, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyue Fan, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yanzi Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuting Wang, Yu Wang, Yuxuan Zhang, Zhao Xue, Zhenyu Hou, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang
Abstract:
We present GLM-4.1V-Thinking and GLM-4.5V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. Code, models and more information are released at https://github.com/zai-org/GLM-V.
中文: GLM-4.1V-Thinking和GLM-4.5V视觉语言模型通过以推理为核心的训练框架和课程采样强化学习,在42个基准测试中取得顶尖性能,超越同类开源模型并与主流闭源模型相媲美。
English: The GLM-4.1V-Thinking and GLM-4.5V vision-language models achieve state-of-the-art performance across 42 benchmarks through a reasoning-centric training framework and Reinforcement Learning with Curriculum Sampling, surpassing similar-sized open-source models and competing with leading closed-source models.
Authors:Jack Nugent, Siyang Wu, Zeyu Ma, Beining Han, Meenal Parakh, Abhishek Joshi, Lingjie Mei, Alexander Raistrick, Xinyuan Li, Jia Deng
Abstract:
Recent years have witnessed substantial progress on monocular depth estimation, particularly as measured by the success of large models on standard benchmarks. However, performance on standard benchmarks does not offer a complete assessment, because most evaluate accuracy but not robustness. In this work, we introduce PDE (Procedural Depth Evaluation), a new benchmark which enables systematic robustness evaluation. PDE uses procedural generation to create 3D scenes that test robustness to various controlled perturbations, including object, camera, material and lighting changes. Our analysis yields interesting findings on what perturbations are challenging for state-of-the-art depth models, which we hope will inform further research. Code and data are available at https://github.com/princeton-vl/proc-depth-eval.
中文: 本文提出PDE基准测试,通过程序化生成三维场景来系统评估单目深度估计模型对各种受控干扰的鲁棒性,揭示了当前先进方法面临的挑战性场景。
English: This paper introduces PDE, a procedurally generated benchmark for systematically evaluating the robustness of monocular depth estimation models to various controlled perturbations, revealing challenging scenarios for state-of-the-art methods.
Authors:Yuheng Du, Sheng Yang, Lingxuan Wang, Zhenghua Hou, Chengying Cai, Zhitao Tan, Mingxia Chen, Shi-Sheng Huang, Qiang Li
Abstract:
While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simultaneously addresses three core challenges in an end-to-end fashion: (1) Uncertainty-aware positional modeling for HD map elements, (2) probabilistic-aware localization w.r.t. the crowdsourced prior-map, and (3) real-time detection for possible road structural changes. Experiments on several public autonomous driving datasets demonstrate our solid performance on both the prior-aided map quality and the localization accuracy, demonstrating our effectiveness of robustly serving downstream prediction and planning modules while gradually improving the accuracy and freshness of the crowdsourced prior-map asynchronously. Our source-code will be made publicly available at https://github.com/CN-ADLab/RTMap.
中文: RTMap通过众包多轨迹高清地图作为自进化记忆,以不确定性建模、概率定位和实时变化检测增强单次轨迹建图,有效解决感知误差、遮挡及多智能体融合问题,显著提升地图质量与定位精度。
English: RTMap enhances single-traversal HD mapping by crowdsourcing a multi-traversal map as a self-evolving memory, addressing perceptual inaccuracies, occlusion, and multi-agent fusion through uncertainty-aware modeling, probabilistic localization, and real-time change detection to improve map quality and localization accuracy.
Authors:Dongyoon Hahm, Woogyeol Jin, June Suk Choi, Sungsoo Ahn, Kimin Lee
Abstract:
As autonomous agents powered by large language models (LLMs) continue to demonstrate potential across various assistive tasks, ensuring their safe and reliable behavior is crucial for preventing unintended consequences. In this work, we introduce CIP, a novel technique that leverages causal influence diagrams (CIDs) to identify and mitigate risks arising from agent decision-making. CIDs provide a structured representation of cause-and-effect relationships, enabling agents to anticipate harmful outcomes and make safer decisions. Our approach consists of three key steps: (1) initializing a CID based on task specifications to outline the decision-making process, (2) guiding agent interactions with the environment using the CID, and (3) iteratively refining the CID based on observed behaviors and outcomes. Experimental results demonstrate that our method effectively enhances safety in both code execution and mobile device control tasks.
中文: 本文提出CIP技术,通过因果影响图识别和减轻自主智能体决策风险,在代码执行和移动设备控制任务中有效提升了安全性。
English: This paper introduces CIP, a novel technique using causal influence diagrams to enhance the safety of autonomous agents by identifying and mitigating risks in decision-making, with experimental validation in code execution and mobile control tasks.
Authors:Ke Liu, Shuaike Shen, Hao Chen
Abstract:
The paradigm of large language models in natural language processing (NLP) has also shown promise in modeling biological languages, including proteins, RNA, and DNA. Both the auto-regressive generation paradigm and evaluation metrics have been transferred from NLP to biological sequence modeling. However, the intrinsic structural correlations in natural and biological languages differ fundamentally. Therefore, we revisit the notion of language in biological systems to better understand how NLP successes can be effectively translated to biological domains. By treating the 3D structure of biomolecules as the semantic content of a sentence and accounting for the strong correlations between residues or bases, we highlight the importance of structural evaluation and demonstrate the applicability of the auto-regressive paradigm in biological language modeling. Code can be found at \href{https://github.com/zjuKeLiu/RiFold}{github.com/zjuKeLiu/RiFold}
中文: 自然语言处理中的大语言模型在生物序列建模中展现出潜力,但需通过将生物分子三维结构视为语义内容并强调结构评估,来适应其内在结构差异。
English: Large language models from NLP show potential in biological sequence modeling, but require adaptation to account for intrinsic structural differences by treating 3D biomolecular structures as semantic content and emphasizing structural evaluation.
Authors:Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, Wei Li, Wei Yin, Yao Yao, Jia Pan, Qiu Shen, Ruigang Yang, Xun Cao, Qionghai Dai
Abstract:
The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real-world interactions. Two foundational technologies, physical simulators and world models, have emerged as critical enablers in this quest. Physical simulators provide controlled, high-fidelity environments for training and evaluating robotic agents, allowing safe and efficient development of complex behaviors. In contrast, world models empower robots with internal representations of their surroundings, enabling predictive planning and adaptive decision-making beyond direct sensory input. This survey systematically reviews recent advances in learning embodied AI through the integration of physical simulators and world models. We analyze their complementary roles in enhancing autonomy, adaptability, and generalization in intelligent robots, and discuss the interplay between external simulation and internal modeling in bridging the gap between simulated training and real-world deployment. By synthesizing current progress and identifying open challenges, this survey aims to provide a comprehensive perspective on the path toward more capable and generalizable embodied AI systems. We also maintain an active repository that contains up-to-date literature and open-source projects at https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey.
中文: 本综述探讨物理仿真器与世界模型如何通过虚拟环境安全训练和内部表征增强现实决策,共同推动具身人工智能的发展。
English: This survey explores how physical simulators and world models jointly advance embodied AI by enabling safe training in virtual environments and enhancing real-world decision-making through internal representations.
Authors:Jindong Han, Yansong Ning, Zirui Yuan, Hang Ni, Fan Liu, Tengfei Lyu, Hao Liu
Abstract:
The long-standing vision of intelligent cities is to create efficient, livable, and sustainable urban environments using big data and artificial intelligence technologies. Recently, the advent of Large Language Models (LLMs) has opened new ways toward realizing this vision. With powerful semantic understanding and reasoning capabilities, LLMs can be deployed as intelligent agents capable of autonomously solving complex problems across domains. In this article, we focus on Urban LLM Agents, which are LLM-powered agents that are semi-embodied within the hybrid cyber-physical-social space of cities and used for system-level urban decision-making. First, we introduce the concept of urban LLM agents, discussing their unique capabilities and features. Second, we survey the current research landscape from the perspective of agent workflows, encompassing urban sensing, memory management, reasoning, execution, and learning. Third, we categorize the application domains of urban LLM agents into five groups: urban planning, transportation, environment, public safety, and urban society, presenting representative works in each group. Finally, we discuss trustworthiness and evaluation issues that are critical for real-world deployment, and identify several open problems for future research. This survey aims to establish a foundation for the emerging field of urban LLM agents and to provide a roadmap for advancing the intersection of LLMs and urban intelligence. A curated list of relevant papers and open-source resources is maintained and continuously updated at https://github.com/usail-hkust/Awesome-Urban-LLM-Agents.
中文摘要:大型语言模型作为智能代理正应用于城市系统级决策,通过其在语义理解和推理方面的强大能力,推动城市规划、交通、环境等领域的智能化发展。
English Summary: Large Language Models (LLMs) are emerging as intelligent agents for system-level urban decision-making, enabling advancements in urban intelligence through their semantic understanding and reasoning capabilities across various city domains.
Authors:Zifu Wan, Ce Zhang, Silong Yong, Martin Q. Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, Yaqi Xie
Abstract:
Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.
中文摘要:本文提出的ONLY方法是一种无需训练的解码方案,通过文本-视觉熵比选择性增强关键文本信息,有效减少大型视觉语言模型的幻觉问题,能以最小计算成本实现实时部署。
English Summary: The proposed ONLY method is a training-free decoding approach that efficiently reduces hallucinations in Large Vision-Language Models by amplifying crucial textual information using a text-to-visual entropy ratio, enabling real-time deployment with minimal computational cost.
Authors:Ruihan Xu, Haokui Zhang, Yaowei Wang, Wei Zeng, Shiliang Zhang
Abstract:
The growing use of deep learning necessitates efficient network design and deployment, making neural predictors vital for estimating attributes such as accuracy and latency. Recently, Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures. However, each of both methods has its disadvantages. GNNs lack the capabilities to represent complicated features, while transformers face poor generalization when the depth of architecture grows. To mitigate the above issues, we rethink neural architecture topology and show that sibling nodes are pivotal while overlooked in previous research. We thus propose a novel predictor leveraging the strengths of GNNs and transformers to learn the enhanced topology. We introduce a novel token mixer that considers siblings, and a new channel mixer named bidirectional graph isomorphism feed-forward network. Our approach consistently achieves promising performance in both accuracy and latency prediction, providing valuable insights for learning Directed Acyclic Graph (DAG) topology. The code is available at https://github.com/XuRuihan/NNFormer.
Chinese: 本研究提出了一种新型神经预测器,融合图神经网络与Transformer的优势,通过引入兄弟节点和双向图同构前馈网络来增强拓扑学习,在神经网络架构的准确性和延迟预测方面均取得优异性能。
English: This study introduces a novel neural predictor that combines Graph Neural Networks and transformers to enhance topology learning by incorporating sibling nodes and a bidirectional graph isomorphism feed-forward network, achieving superior accuracy and latency predictions for neural architectures.
Authors:Wei Li, Jiaman Tang, Yang Li, Beihao Xia, Ligang Tan, Hongmao Qin
Abstract:
Unmanned Aerial Vehicle (UAV) object detection has been widely used in traffic management, agriculture, emergency rescue, etc. However, it faces significant challenges, including occlusions, small object sizes, and irregular shapes. These challenges highlight the necessity for a robust and efficient multimodal UAV object detection method. Mamba has demonstrated considerable potential in multimodal image fusion. Leveraging this, we propose UAVD-Mamba, a multimodal UAV object detection framework based on Mamba architectures. To improve geometric adaptability, we propose the Deformable Token Mamba Block (DTMB) to generate deformable tokens by incorporating adaptive patches from deformable convolutions alongside normal patches from normal convolutions, which serve as the inputs to the Mamba Block. To optimize the multimodal feature complementarity, we design two separate DTMBs for the RGB and infrared (IR) modalities, with the outputs from both DTMBs integrated into the Mamba Block for feature extraction and into the Fusion Mamba Block for feature fusion. Additionally, to improve multiscale object detection, especially for small objects, we stack four DTMBs at different scales to produce multiscale feature representations, which are then sent to the Detection Neck for Mamba (DNM). The DNM module, inspired by the YOLO series, includes modifications to the SPPF and C3K2 of YOLOv11 to better handle the multiscale features. In particular, we employ cross-enhanced spatial attention before the DTMB and cross-channel attention after the Fusion Mamba Block to extract more discriminative features. Experimental results on the DroneVehicle dataset show that our method outperforms the baseline OAFA method by 3.6% in the mAP metric. Codes will be released at https://github.com/GreatPlum-hnu/UAVD-Mamba.git.
Chinese: 提出的UAVD-Mamba框架利用Mamba架构解决无人机目标检测中的遮挡和小物体等挑战,通过引入可变形令牌块和多模态融合,在DroneVehicle数据集上比基线方法mAP指标提升了3.6%。
English: The proposed UAVD-Mamba framework leverages Mamba architectures to address challenges in UAV object detection, such as occlusions and small objects, by introducing deformable token blocks and multimodal fusion, achieving a 3.6% mAP improvement over the baseline on the DroneVehicle dataset.
Authors:Alexander Hoyle, Lorena Calvo-Bartolomé, Jordan Boyd-Graber, Philip Resnik
Abstract:
Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators -- or an LLM-based proxy -- review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations. Package, web interface, and data are at https://github.com/ahoho/proxann
中文: 研究人员设计了一种可扩展的人工评估方案及相应的自动化LLM代理方法,通过让标注者或LLM推断主题类别并应用于文档评估,发现最优代理模型能达到与人工标注相媲美的评估效果。
English: Researchers developed a scalable human evaluation protocol and an automated LLM-based proxy that mimics human judgment for assessing topic models and document clustering, finding the best proxies perform comparably to human annotators.
Authors:Yasser El Jarida, Youssef Iraqi, Loubna Mekouar
Abstract:
Accurate particle size distribution (PSD) measurement is important in industries such as mining, pharmaceuticals, and fertilizer manufacturing, significantly influencing product quality and operational efficiency. Traditional PSD methods like sieve analysis and laser diffraction are manual, time-consuming, and limited by particle overlap. Recent developments in convolutional neural networks (CNNs) enable automated, real-time PSD estimation directly from particle images. In this work, we present a CNN-based methodology trained on realistic synthetic particle imagery generated using Blender's advanced rendering capabilities. Synthetic data sets using this method can replicate various industrial scenarios by systematically varying particle shapes, textures, lighting, and spatial arrangements that closely resemble the actual configurations. We evaluated three CNN-based architectures, ResNet-50, InceptionV3, and EfficientNet-B0, for predicting critical PSD parameters (d10, d50, d90). Results demonstrated comparable accuracy across models, with EfficientNet-B0 achieving the best computational efficiency suitable for real-time industrial deployment. This approach shows the effectiveness of realistic synthetic data for robust CNN training, which offers significant potential for automated industrial PSD monitoring. The code is released at : https://github.com/YasserElj/Synthetic-Granular-Gen
中文摘要:本研究提出基于卷积神经网络的方法,利用逼真合成颗粒图像实现粒度分布的自动化测量,其中EfficientNet-B0模型展现出最佳计算效率,适用于工业实时监测。
English Summary: This study introduces a CNN-based method using realistic synthetic particle images to automate particle size distribution measurement, with EfficientNet-B0 showing optimal computational efficiency for industrial deployment.
Authors:Minye Shao, Xingyu Miao, Haoran Duan, Zeyu Wang, Jingkun Chen, Yawen Huang, Xian Wu, Jingjing Deng, Yang Long, Yefeng Zheng
Abstract:
3D medical image generation is essential for data augmentation and patient privacy, calling for reliable and efficient models suited for clinical practice. However, current methods suffer from limited anatomical fidelity, restricted axial length, and substantial computational cost, placing them beyond reach for regions with limited resources and infrastructure. We introduce TRACE, a framework that generates 3D medical images with spatiotemporal alignment using a 2D multimodal-conditioned diffusion approach. TRACE models sequential 2D slices as video frame pairs, combining segmentation priors and radiology reports for anatomical alignment, incorporating optical flow to sustain temporal coherence. During inference, an overlapping-frame strategy links frame pairs into a flexible length sequence, reconstructed into a spatiotemporally and anatomically aligned 3D volume. Experimental results demonstrate that TRACE effectively balances computational efficiency with preserving anatomical fidelity and spatiotemporal consistency. Code is available at: https://github.com/VinyehShaw/TRACE.
中文: TRACE框架采用二维多模态条件扩散方法生成时空与解剖结构对齐的三维医学图像,在保证计算效率的同时有效解决了现有方法在解剖保真度和生成长度方面的局限性。
English: TRACE is a computationally efficient framework that generates spatiotemporally and anatomically aligned 3D medical images using a 2D multimodal-conditioned diffusion approach, overcoming limitations in fidelity and scalability of current methods.
Authors:Hendric Voss, Stefan Kopp
Abstract:
Generating accurate and realistic virtual human movements in real-time is of high importance for a variety of applications in computer graphics, interactive virtual environments, robotics, and biomechanics. This paper introduces a novel real-time inverse kinematics (IK) solver specifically designed for realistic human-like movement generation. Leveraging the automatic differentiation and just-in-time compilation of TensorFlow, the proposed solver efficiently handles complex articulated human skeletons with high degrees of freedom. By treating forward and inverse kinematics as differentiable operations, our method effectively addresses common challenges such as error accumulation and complicated joint limits in multi-constrained problems, which are critical for realistic human motion modeling. We demonstrate the solver's effectiveness on the SMPLX human skeleton model, evaluating its performance against widely used iterative-based IK algorithms, like Cyclic Coordinate Descent (CCD), FABRIK, and the nonlinear optimization algorithm IPOPT. Our experiments cover both simple end-effector tasks and sophisticated, multi-constrained problems with realistic joint limits. Results indicate that our IK solver achieves real-time performance, exhibiting rapid convergence, minimal computational overhead per iteration, and improved success rates compared to existing methods. The project code is available at https://github.com/hvoss-techfak/JAX-IK
中文: 本文提出了一种新颖的实时逆运动学求解器,利用TensorFlow的自动微分和即时编译技术高效生成逼真人体运动,在收敛速度和成功率方面优于现有方法。
English: This paper presents a novel real-time inverse kinematics solver that uses TensorFlow's automatic differentiation and just-in-time compilation to generate realistic human movements efficiently, outperforming existing methods in convergence speed and success rates.
Authors:Huaqiu Li, Yong Wang, Tongwen Huang, Hailang Huang, Haoqian Wang, Xiangxiang Chu
Abstract:
Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data are available at https://github.com/AMAP-ML/LD-RPS.
中文: 本文提出了一种新颖、无需数据集的统一图像恢复方法,利用预训练的潜在扩散模型,结合多模态理解提供语义先验,并通过循环细化在任务无关条件下超越现有最优方法。
English: This paper introduces a novel, dataset-free unified image restoration method using a pretrained latent diffusion model, which incorporates multimodal understanding for semantic priors and recurrent refinement to outperform state-of-the-art approaches.
Authors:Dongyoon Hwang, Hojoon Lee, Jaegul Choo, Dongmin Park, Jongho Park
Abstract:
While reinforcement learning (RL) for large language models (LLMs) has shown promise in mathematical reasoning, strategic reasoning for LLMs using RL remains largely unexplored. We investigate whether LLMs can develop strategic reasoning capabilities through RL in chess. To this end, we leverage a chess-pretrained action-value network to provide dense reward on the LLM's output move quality, which can be seen as a form of knowledge distillation. Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards. However, surprisingly, all models plateau far below expert levels. We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models' internal understanding of chess-a deficit which RL alone may not be able to fully overcome. The code is available at https://github.com/krafton-ai/Chess-R1.
中文摘要:本研究通过使用国际象棋预训练网络提供的密集奖励进行强化学习,探索提升大语言模型在国际象棋中的策略推理能力,发现虽优于稀疏奖励,但因模型内在缺陷,性能仍远低于专家水平。
English Summary: This study explores enhancing large language models' strategic reasoning in chess through reinforcement learning with dense rewards from a chess-pretrained network, finding improved performance over sparse rewards but persistent limitations below expert levels due to inherent model deficits.
Authors:Xiao Zhang, Fei Wei, Yong Wang, Wenda Zhao, Feiyi Li, Xiangxiang Chu
Abstract:
Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. Specifically, our approach introduces a multi-view domain prompt that combines linguistic domain priors with detection-specific knowledge, and a visual representation enhancement module that produces domain style variations. Furthermore, we introduce multi-level enhancement strategies, including relative domain distance and positive-negative separation, which align multi-modal representations at the image level and capture diverse visual representations at the instance level, respectively. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our framework in ZSDA detection scenarios. Code is available at https://github.com/AMAP-ML/UPRE.
中文摘要:提出的UPRE框架通过联合优化文本提示和视觉表征,克服了零样本领域自适应中的关键挑战,在九个基准数据集上展现出卓越性能。
English Summary: The proposed UPRE framework overcomes limitations in zero-shot domain adaptation by jointly optimizing textual prompts and visual representations, achieving superior performance across nine benchmark datasets.
Authors:Zeming Chen, Hang Zhao
Abstract:
Multi-view image generation in autonomous driving demands consistent 3D scene understanding across camera views. Most existing methods treat this problem as a 2D image set generation task, lacking explicit 3D modeling. However, we argue that a structured representation is crucial for scene generation, especially for autonomous driving applications. This paper proposes BEV-VAE for consistent and controllable view synthesis. BEV-VAE first trains a multi-view image variational autoencoder for a compact and unified BEV latent space and then generates the scene with a latent diffusion transformer. BEV-VAE supports arbitrary view generation given camera configurations, and optionally 3D layouts. Experiments on nuScenes and Argoverse 2 (AV2) show strong performance in both 3D consistent reconstruction and generation. The code is available at: https://github.com/Czm369/bev-vae.
Chinese: BEV-VAE提出了一种创新的自动驾驶多视角图像生成方法,通过统一的BEV潜在空间和潜在扩散变换器,确保跨相机视角的3D一致性和可控性。
English: BEV-VAE introduces a novel approach for multi-view image generation in autonomous driving by leveraging a unified BEV latent space and a latent diffusion transformer to ensure 3D consistency and controllability across camera views.
Authors:Chong Zhang, Xichao Liu, Yibing Zhan, Dapeng Tao, Jun Ni, Jinwei Bu
Abstract:
Recent advancements in spaceborne GNSS missions have produced extensive global datasets, providing a robust basis for deep learning-based significant wave height (SWH) retrieval. While existing deep learning models predominantly utilize CYGNSS data with four-channel information, they often adopt single-channel inputs or simple channel concatenation without leveraging the benefits of cross-channel information interaction during training. To address this limitation, a novel spatial-channel attention-based network, namely SCAWaveNet, is proposed for SWH retrieval. Specifically, features from each channel of the DDMs are modeled as independent attention heads, enabling the fusion of spatial and channel-wise information. For auxiliary parameters, a lightweight attention mechanism is designed to assign weights along the spatial and channel dimensions. The final feature integrates both spatial and channel-level characteristics. Model performance is evaluated using four-channel CYGNSS data. When ERA5 is used as a reference, SCAWaveNet achieves an average RMSE of 0.438 m. When using buoy data from NDBC, the average RMSE reaches 0.432 m. Compared to state-of-the-art models, SCAWaveNet reduces the average RMSE by at least 3.52% on the ERA5 dataset and by 5.68% on the NDBC buoy observations. The code is available at https://github.com/Clifx9908/SCAWaveNet.
Chinese: 提出了一种名为SCAWaveNet的新型空间-通道注意力网络,通过有效融合CYGNSS数据的跨通道信息来提升有效波高反演精度,在ERA5和NDBC数据集上相比现有模型分别实现了至少3.52%和5.68%的均方根误差降低。
English: A novel spatial-channel attention network called SCAWaveNet is introduced to enhance significant wave height retrieval by effectively fusing cross-channel information from CYGNSS data, achieving improved accuracy with RMSE reductions of at least 3.52% and 5.68% over existing models on ERA5 and NDBC datasets, respectively.
Authors:Guoliang Duan, Mingwei Liu, Yanlin Wang, Chong Wang, Xin Peng, Zibin Zheng
Abstract:
Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored. Existing benchmarks often prioritize functional correctness, overlooking the nuanced requirements found in real-world development. We introduce MultiCodeIF, a comprehensive benchmark designed to evaluate instruction-following in code generation across multiple dimensions: constraint type, hierarchical levels, and iterative refinement. Built upon a structured taxonomy of 9 categories and 27 constraint types, MultiCodeIF enables granular assessment of both functional and non-functional instruction adherence. Using an automated pipeline, ConstraGen, we synthesize and evolve 2,021 code tasks sourced from 14 programming languages, supporting multi-turn evaluation through feedback-driven task variants. Empirical evaluation of six state-of-the-art LLMs uncovers substantial performance disparities. The top-performing model, Claude-3-7-Sonnet, achieves 63.0% average constraint satisfaction, while smaller models like Qwen3-1.7B fall to 44.8%. Models perform well on explicit constraints, but struggle with implicit or abstract constraints. Tasks with multiple hierarchical constraints significantly reduce model success rates, from 54.5% in single-level to just 18.8% in multi-level scenarios. However, structured feedback enables progressive improvement: average constraint satisfaction rises from 63.0% to 83.4% over four iterative refinement rounds. MultiCodeIF provides a scalable, constraint-aware, and feedback-sensitive framework to benchmark LLMs under realistic code generation scenarios, bridging the gap between synthetic evaluations and real-world instruction complexity. The full benchmark dataset, evaluation pipeline, and source code are available at https://github.com/SYSUSELab/MultiCodeIF.
中文:MultiCodeIF是一个综合性基准测试,旨在评估大语言模型在多维度约束条件下遵循复杂编程指令的能力,揭示了显著的性能差距,并通过反馈驱动的迭代优化实现持续改进。
English: MultiCodeIF is a comprehensive benchmark designed to evaluate how well large language models follow complex programming instructions across multiple constraint dimensions, revealing significant performance gaps and enabling iterative improvement through feedback-driven refinement.
Authors:Qihang Fan, Huaibo Huang, Yuang Ai, Ran He
Abstract:
As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query. This prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly different attention score distribution. Based on this observation, we propose Magnitude-Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query's magnitude. This adjustment allows MALA to generate an attention score distribution that closely resembles Softmax Attention while exhibiting a more well-balanced structure. We evaluate the effectiveness of MALA on multiple tasks, including image classification, object detection, instance segmentation, semantic segmentation, natural language processing, speech recognition, and image generation. Our MALA achieves strong results on all of these tasks. Code will be available at https://github.com/qhfan/MALA
中文: 本文提出的幅度感知线性注意力(MALA)通过引入查询向量幅值计算,使线性注意力能生成接近Softmax注意力的分布,在多项任务中取得优异性能。
English: This paper introduces Magnitude-Aware Linear Attention (MALA), which addresses Linear Attention's performance gap by incorporating Query magnitude to mimic Softmax Attention's distribution, achieving strong results across diverse tasks.
Authors:Jan Nikolas Morshuis, Christian Schlarmann, Thomas Küstner, Christian F. Baumgartner, Matthias Hein
Abstract:
In recent years, accelerated MRI reconstruction based on deep learning has led to significant improvements in image quality with impressive results for high acceleration factors. However, from a clinical perspective image quality is only secondary; much more important is that all clinically relevant information is preserved in the reconstruction from heavily undersampled data. In this paper, we show that existing techniques, even when considering resampling for diffusion-based reconstruction, can fail to reconstruct small and rare pathologies, thus leading to potentially wrong diagnosis decisions (false negatives). To uncover the potentially missing clinical information we propose ``Semantically Diverse Reconstructions'' (\SDR), a method which, given an original reconstruction, generates novel reconstructions with enhanced semantic variability while all of them are fully consistent with the measured data. To evaluate \SDR automatically we train an object detector on the fastMRI+ dataset. We show that \SDR significantly reduces the chance of false-negative diagnoses (higher recall) and improves mean average precision compared to the original reconstructions. The code is available on https://github.com/NikolasMorshuis/SDR
中文摘要:深度学习加速的MRI重建可能遗漏微小或罕见病变,因此提出的语义多样性重建(SDR)方法通过生成多种重建结果来降低漏诊率并提升诊断精确度。
English Summary: Deep learning-based accelerated MRI reconstruction may miss small or rare pathologies, so the proposed Semantically Diverse Reconstructions (SDR) method generates multiple reconstructions to reduce false negatives and improve diagnostic accuracy.
Authors:Sihang Li, Wei Shi, Ziyuan Xie, Tao Liang, Guojun Ma, Xiang Wang
Abstract:
Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification, without sacrificing general chat performance. Our approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks. Our codes are available at https://github.com/xzy-101/SAFER-code. \textit{This paper discusses topics related to large language model safety and may include discussions or examples that highlight potential risks or unsafe outcomes.}
中文:SAFER框架通过稀疏自编码器解析并改进强化学习人类反馈中的奖励模型,利用针对性数据策略精准调控安全对齐效果,且不影响通用聊天性能。
English: The SAFER framework utilizes sparse autoencoders to interpret and enhance reward models in RLHF, enabling targeted data manipulation that precisely adjusts safety alignment without compromising general performance.
Authors:Rusi Chen, Yuanting Yang, Jiezhi Yao, Hongning Song, Ji Zhang, Yongsong Zhou, Yuhao Huang, Ronghao Yang, Dan Jia, Yuhan Zhang, Xing Tao, Haoran Dou, Qing Zhou, Xin Yang, Dong Ni
Abstract:
Mitral regurgitation is one of the most prevalent cardiac disorders. Four-dimensional (4D) ultrasound has emerged as the primary imaging modality for assessing dynamic valvular morphology. However, 4D mitral valve (MV) analysis remains challenging due to limited phase annotations, severe motion artifacts, and poor imaging quality. Yet, the absence of inter-phase dependency in existing methods hinders 4D MV analysis. To bridge this gap, we propose a Motion-Topology guided consistency network (MTCNet) for accurate 4D MV ultrasound segmentation in semi-supervised learning (SSL). MTCNet requires only sparse end-diastolic and end-systolic annotations. First, we design a cross-phase motion-guided consistency learning strategy, utilizing a bi-directional attention memory bank to propagate spatio-temporal features. This enables MTCNet to achieve excellent performance both per- and inter-phase. Second, we devise a novel topology-guided correlation regularization that explores physical prior knowledge to maintain anatomically plausible. Therefore, MTCNet can effectively leverage structural correspondence between labeled and unlabeled phases. Extensive evaluations on the first largest 4D MV dataset, with 1408 phases from 160 patients, show that MTCNet performs superior cross-phase consistency compared to other advanced methods (Dice: 87.30%, HD: 1.75mm). Both the code and the dataset are available at https://github.com/crs524/MTCNet.
中文: 提出的运动-拓扑引导一致性网络(MTCNet)通过跨相位运动一致性学习和拓扑正则化方法,在仅需稀疏标注的情况下实现了对四维二尖瓣超声的精准分割,并在大规模临床数据上展现出优越性能。
English: The proposed Motion-Topology guided consistency network (MTCNet) addresses challenges in 4D mitral valve ultrasound segmentation by leveraging cross-phase motion consistency and topology regularization, achieving superior performance with only sparse annotations on a large clinical dataset.
Authors:Siyuan Yao, Rui Zhu, Ziqi Wang, Wenqi Ren, Yanyang Yan, Xiaochun Cao
Abstract:
Visual object tracking has gained promising progress in past decades. Most of the existing approaches focus on learning target representation in well-conditioned daytime data, while for the unconstrained real-world scenarios with adverse weather conditions, e.g. nighttime or foggy environment, the tremendous domain shift leads to significant performance degradation. In this paper, we propose UMDATrack, which is capable of maintaining high-quality target state prediction under various adverse weather conditions within a unified domain adaptation framework. Specifically, we first use a controllable scenario generator to synthesize a small amount of unlabeled videos (less than 2% frames in source daytime datasets) in multiple weather conditions under the guidance of different text prompts. Afterwards, we design a simple yet effective domain-customized adapter (DCA), allowing the target objects' representation to rapidly adapt to various weather conditions without redundant model updating. Furthermore, to enhance the localization consistency between source and target domains, we propose a target-aware confidence alignment module (TCA) following optimal transport theorem. Extensive experiments demonstrate that UMDATrack can surpass existing advanced visual trackers and lead new state-of-the-art performance by a significant margin. Our code is available at https://github.com/Z-Z188/UMDATrack.
中文: UMDATrack提出了一种统一域适应框架,通过可控场景生成器、域定制适配器和目标感知置信度对齐模块,在多种恶劣天气条件下保持高质量视觉目标跟踪,实现了最先进的性能。
English: UMDATrack introduces a unified domain adaptation framework that maintains high-quality visual object tracking across adverse weather conditions through a controllable scenario generator, domain-customized adapter, and target-aware confidence alignment module, achieving state-of-the-art performance.
Authors:Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, Dongbin Zhao
Abstract:
End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, end-to-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.1\% relative reduction in L2 error, 46.7% lower collision rate, and 3.75 faster training convergence. Codes will be accessed at https://github.com/ucaszyp/World4Drive.
中文: World4Drive是一种端到端自动驾驶框架,利用视觉基础模型构建潜在世界模型来生成和评估多模态规划轨迹,通过自监督学习无需人工感知标注即可实现最先进的性能。
English: World4Drive is an end-to-end autonomous driving framework that uses vision foundation models to create latent world models for generating and evaluating multi-modal planning trajectories, achieving state-of-the-art performance without manual perception annotations through self-supervised learning.
Authors:Hao Tang, Zhiqing Guo, Liejun Wang, Chao Liu
Abstract:
In recent years, it has been found that "grandmother cells" in the primary visual cortex (V1) of macaques can directly recognize visual input with complex shapes. This inspires us to examine the value of these cells in promoting the research of medical image segmentation. In this paper, we design a Similarity Memory Prior Network (Sim-MPNet) for medical image segmentation. Specifically, we propose a Dynamic Memory Weights-Loss Attention (DMW-LA), which matches and remembers the category features of specific lesions or organs in medical images through the similarity memory prior in the prototype memory bank, thus helping the network to learn subtle texture changes between categories. DMW-LA also dynamically updates the similarity memory prior in reverse through Weight-Loss Dynamic (W-LD) update strategy, effectively assisting the network directly extract category features. In addition, we propose the Double-Similarity Global Internal Enhancement Module (DS-GIM) to deeply explore the internal differences in the feature distribution of input data through cosine similarity and euclidean distance. Extensive experiments on four public datasets show that Sim-MPNet has better segmentation performance than other state-of-the-art methods. Our code is available on https://github.com/vpsg-research/Sim-MPNet.
中文:受猕猴视觉皮层"祖母细胞"启发,本文提出Sim-MPNet模型,通过DMW-LA和DS-GIM模块实现医学图像精准分割,在多个公开数据集上取得领先性能。
English: Inspired by "grandmother cells" in macaques' visual cortex, this paper introduces Sim-MPNet with DMW-LA and DS-GIM modules for enhanced medical image segmentation, achieving state-of-the-art results on multiple datasets.
Authors:Kai Zhou, Shuhai Zhang, Zeng You, Jinwu Hu, Mingkui Tan, Fei Liu
Abstract:
Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training. This task is extremely challenging due to the difficulty in generalizing from known to unknown actions. Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features, enabling knowledge transfer to unseen classes through skeleton-text alignment and language models' generalization. However, their efficacy is hindered by 1) insufficient discrimination for skeleton features, as the fixed skeleton encoder fails to capture necessary alignment information for effective skeleton-text alignment; 2) the neglect of alignment bias between skeleton and unseen text features during testing. To this end, we propose a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, termed PGFA. Specifically, we develop an end-to-end cross-modal contrastive training framework to improve skeleton-text alignment, ensuring sufficient discrimination for skeleton features. Additionally, we introduce a prototype-guided text feature alignment strategy to mitigate the adverse impact of the distribution discrepancy during testing. We provide a theoretical analysis to support our prototype-guided text feature alignment strategy and empirically evaluate our overall PGFA on three well-known datasets. Compared with the top competitor SMIE method, our PGFA achieves absolute accuracy improvements of 22.96%, 12.53%, and 18.54% on the NTU-60, NTU-120, and PKU-MMD datasets, respectively.
中文: 本文提出PGFA原型引导特征对齐框架,通过端到端跨模态对比训练和原型引导策略增强骨架-文本对齐并减少分布差异,在多个数据集上显著提升了零样本骨架动作识别的准确率。
English: This paper introduces PGFA, an end-to-end prototype-guided feature alignment framework that enhances skeleton-text alignment and mitigates distribution discrepancies to significantly improve zero-shot skeleton-based action recognition accuracy across multiple datasets.
Authors:Jiajie Zhang, Shenrui Wu, Xu Ma, Sören Schwertfeger
Abstract:
The deployment of autonomous mobile robots is predicated on the availability of environmental maps, yet conventional generation via SLAM (Simultaneous Localization and Mapping) suffers from significant limitations in time, labor, and robustness, particularly in dynamic, large-scale indoor environments where map obsolescence can lead to critical localization failures. To address these challenges, this paper presents a complete and automated system for converting architectural Computer-Aided Design (CAD) files into a hierarchical topometric OpenStreetMap (OSM) representation, tailored for robust life-long robot navigation. Our core methodology involves a multi-stage pipeline that first isolates key structural layers from the raw CAD data and then employs an AreaGraph-based topological segmentation to partition the building layout into a hierarchical graph of navigable spaces. This process yields a comprehensive and semantically rich map, further enhanced by automatically associating textual labels from the CAD source and cohesively merging multiple building floors into a unified, topologically-correct model. By leveraging the permanent structural information inherent in CAD files, our system circumvents the inefficiencies and fragility of SLAM, offering a practical and scalable solution for deploying robots in complex indoor spaces. The software is encapsulated within an intuitive Graphical User Interface (GUI) to facilitate practical use. The code and dataset are available at https://github.com/jiajiezhang7/osmAG-from-cad.
中文摘要:本文提出一种自动化系统,将建筑CAD文件转换为分层OpenStreetMap拓扑地图,通过利用CAD固有结构信息规避SLAM技术的缺陷,为动态室内环境中的机器人提供鲁棒且可扩展的导航解决方案。
English Summary: This paper introduces an automated system that converts architectural CAD files into hierarchical OpenStreetMap representations, bypassing SLAM limitations to provide robust and scalable robot navigation in dynamic indoor environments.
Authors:Ruize Cui, Jiaan Zhang, Jialun Pei, Kai Wang, Pheng-Ann Heng, Jing Qin
Abstract:
Liver landmarks provide crucial anatomical guidance to the surgeon during laparoscopic liver surgery to minimize surgical risk. However, the tubular structural properties of landmarks and dynamic intraoperative deformations pose significant challenges for automatic landmark detection. In this study, we introduce TopoNet, a novel topology-constrained learning framework for laparoscopic liver landmark detection. Our framework adopts a snake-CNN dual-path encoder to simultaneously capture detailed RGB texture information and depth-informed topological structures. Meanwhile, we propose a boundary-aware topology fusion (BTF) module, which adaptively merges RGB-D features to enhance edge perception while preserving global topology. Additionally, a topological constraint loss function is embedded, which contains a center-line constraint loss and a topological persistence loss to ensure homotopy equivalence between predictions and labels. Extensive experiments on L3D and P2ILF datasets demonstrate that TopoNet achieves outstanding accuracy and computational complexity, highlighting the potential for clinical applications in laparoscopic liver surgery. Our code will be available at https://github.com/cuiruize/TopoNet.
Chinese: TopoNet是一种新颖的拓扑约束学习框架,通过自适应特征融合和专用损失函数整合RGB纹理与深度拓扑结构,显著提升了腹腔镜肝脏标志物检测的精度与效率,在临床数据集中表现卓越。
English: TopoNet is a novel topology-constrained learning framework that enhances laparoscopic liver landmark detection by integrating RGB texture and depth-informed topological structures through adaptive feature fusion and specialized loss functions, demonstrating superior accuracy and efficiency in clinical datasets.
Authors:Vincent Duchêne, Johanna Ulvedal Marstrander
Abstract:
We discuss the rigorous justification of the spatial discretization by means of Fourier spectral methods of quasilinear first-order hyperbolic systems. We provide uniform stability estimates that grant spectral convergence of the (spatially) semi-discretized solutions towards the corresponding continuous solution provided that the underlying system satisfies some suitable structural assumptions. We consider a setting with sharp low-pass filters and a setting with smooth low-pass filters and argue that - at least theoretically - smooth low-pass filters are operable on a larger class of systems. While our theoretical results are supported with numerical evidence, we also pinpoint some behavior of the numerical method that currently has no theoretical explanation.
中文摘要:本研究严格论证了傅里叶谱方法在拟线性双曲系统中的离散化应用,在满足结构假设条件下证明了谱收敛性,同时指出了尚未得到理论解释的数值现象。
English Summary: This study rigorously justifies the Fourier spectral method for discretizing quasilinear hyperbolic systems, demonstrating spectral convergence under structural assumptions while noting unexplained numerical behaviors.
Authors:Haoran Lou, Chunxiao Fan, Ziyan Liu, Yuexin Wu, Xinliang Wang
Abstract:
The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which only adds six spatial visual tokens to the original visual tokens to enhance the visual representation. Our approach offers three key advantages: 1) We propose a novel Projector, which uses convolutional kernels to derive visual spatial tokens from ViT patch features, simulating two visual spatial ordering approaches: "from central region to global" and "from abstract to specific". Then, a cross-attention mechanism is applied to fuse fine-grained visual information, enriching the overall visual representation. 2) We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling, enabling the model to handle diverse visual understanding tasks. 3) Extensive experiments show that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming the state-of-the-art LLaVA-1.5 model in multiple tasks with nearly identical inference latency. The code and models are available at https://github.com/CnFaker/LLaVA-SP.
Chinese: 提出的LLaVA-SP模型通过新型投影器添加空间视觉标记,显著提升了多模态模型的视觉表征能力,在多项基准测试中超越LLaVA-1.5且推理延迟几乎不变。
English: The proposed LLaVA-SP model enhances multimodal language models by adding spatial visual tokens through a novel projector, significantly improving visual representation and outperforming LLaVA-1.5 across various benchmarks with minimal latency increase.
Authors:Yongzhen Wang, Liangliang Chen, Bingwen Hu, Heng Liu, Xiao-Ping Zhang, Mingqiang Wei
Abstract:
Recent progress in image restoration has underscored Spatial State Models (SSMs) as powerful tools for modeling long-range dependencies, owing to their appealing linear complexity and computational efficiency. However, SSM-based approaches exhibit limitations in reconstructing localized structures and tend to be less effective when handling high-dimensional data, frequently resulting in suboptimal recovery of fine image features. To tackle these challenges, we introduce Laplace-Mamba, a novel framework that integrates Laplace frequency prior with a hybrid Mamba-CNN architecture for efficient image dehazing. Leveraging the Laplace decomposition, the image is disentangled into low-frequency components capturing global texture and high-frequency components representing edges and fine details. This decomposition enables specialized processing via dual parallel pathways: the low-frequency branch employs SSMs for global context modeling, while the high-frequency branch utilizes CNNs to refine local structural details, effectively addressing diverse haze scenarios. Notably, the Laplace transformation facilitates information-preserving downsampling of low-frequency components in accordance with the Nyquist theory, thereby significantly improving computational efficiency. Extensive evaluations across multiple benchmarks demonstrate that our method outperforms state-of-the-art approaches in both restoration quality and efficiency. The source code and pretrained models are available at https://github.com/yz-wang/Laplace-Mamba.
中文摘要:提出的Laplace-Mamba框架通过拉普拉斯频率分解与混合Mamba-CNN架构相结合,利用并行通路分别处理全局纹理和局部细节,有效解决图像去雾问题,在恢复质量和效率方面均优于现有方法。
English Summary: The proposed Laplace-Mamba framework combines Laplace frequency decomposition with a hybrid Mamba-CNN architecture to effectively address image dehazing by processing global textures and local details through parallel pathways, achieving superior restoration quality and efficiency.
Authors:Zijian Chen, Yuan Tian, Yuze Sun, Wei Sun, Zicheng Zhang, Weisi Lin, Guangtao Zhai, Wenjun Zhang
Abstract:
Just noticeable difference (JND), the minimum change that the human visual system (HVS) can perceive, has been studied for decades. Although recent work has extended this line of research into machine vision, there has been a scarcity of studies systematically exploring its perceptual boundaries across multiple tasks and stimulus types, particularly in the current era of rapidly advancing large multimodal models (LMMs), where studying the multifaceted capabilities of models has become a mainstream focus. Moreover, the perceptual defects of LMMs are not investigated thoroughly, resulting in potential security issues and suboptimal response efficiency. In this paper, we take an initial attempt and demonstrate that there exist significant visual blind spots in current LMMs. To systemically quantify this characteristic, we propose a new concept, {\bf LMM-JND}, together with its determination pipeline. Targeting uncovering the behavior commonalities in HVS-aligned visual perception tasks, we delve into several LMM families and construct a large-scale dataset, named VPA-JND, which contains 21.5k reference images with over 489k stimuli across 12 distortion types, to facilitate LMM-JND studies. VPA-JND exposes areas where state-of-the-art LMMs, including GPT-4o and the InternVL2.5 series, struggle with basic comparison queries and fall significantly short of human-level visual performance. We further explore the effects of vision and language backbones and find a notable correlation between their design philosophy that may instruct the future refinement of LMMs for their visual acuity. Together, our research underscores the significance of LMM-JND as a unique perspective for studying LMMs, and predictable LMM-JND is crucial for security concerns. This work will be available at https://github.com/zijianchen98/LMM-JND.
中文: 本研究提出LMM-JND概念系统量化大型多模态模型的视觉盲区,通过构建大规模数据集和分析,揭示了当前先进模型在视觉感知能力上与人类存在的显著差距。
English: This study introduces LMM-JND to systematically quantify visual blind spots in large multimodal models (LMMs), revealing their significant perceptual limitations compared to humans through a comprehensive dataset and analysis.
Authors:Jianghao Lin, Xinyuan Wang, Xinyi Dai, Menghui Zhu, Bo Chen, Ruiming Tang, Yong Yu, Weinan Zhang
Abstract:
Tool retrieval is a critical component in enabling large language models (LLMs) to interact effectively with external tools. It aims to precisely filter the massive tools into a small set of candidates for the downstream tool-augmented LLMs. However, most existing approaches primarily focus on optimizing tool representations, often neglecting the importance of precise query comprehension. To address this gap, we introduce MassTool, a multi-task search-based framework designed to enhance both query representation and tool retrieval accuracy. MassTool employs a two-tower architecture: a tool usage detection tower that predicts the need for function calls, and a tool retrieval tower that leverages a query-centric graph convolution network (QC-GCN) for effective query-tool matching. It also incorporates search-based user intent modeling (SUIM) to handle diverse and out-of-distribution queries, alongside an adaptive knowledge transfer (AdaKT) module for efficient multi-task learning. By jointly optimizing tool usage detection loss, list-wise retrieval loss, and contrastive regularization loss, MassTool establishes a robust dual-step sequential decision-making pipeline for precise query understanding. Extensive experiments demonstrate its effectiveness in improving retrieval accuracy. Our code is available at https://github.com/wxydada/MassTool.
中文摘要:MassTool是一个多任务框架,通过双塔架构和基于搜索的用户意图建模来增强大语言模型的工具检索能力,有效提升了查询与工具的匹配精度。
English Summary: MassTool is a multi-task framework that enhances tool retrieval for LLMs by improving query representation through a two-tower architecture and search-based intent modeling, achieving higher accuracy in matching queries with appropriate tools.
Authors:Weiran Guo, Guanjun Liu, Ziyuan Zhou, Ling Wang
Abstract:
Reinforcement Learning (RL) is widely used in tasks where agents interact with an environment to maximize rewards. Building on this foundation, Safe Reinforcement Learning (Safe RL) incorporates a cost metric alongside the reward metric, ensuring that agents adhere to safety constraints during decision-making. In this paper, we identify that Safe RL is vulnerable to backdoor attacks, which can manipulate agents into performing unsafe actions. First, we introduce the relevant concepts and evaluation metrics for backdoor attacks in Safe RL. It is the first attack framework in the Safe RL field that involves both Positive and Negative Action sample (PNAct) is to implant backdoors, where positive action samples provide reference actions and negative action samples indicate actions to be avoided. We theoretically point out the properties of PNAct and design an attack algorithm. Finally, we conduct experiments to evaluate the effectiveness of our proposed backdoor attack framework, evaluating it with the established metrics. This paper highlights the potential risks associated with Safe RL and underscores the feasibility of such attacks. Our code and supplementary material are available at https://github.com/azure-123/PNAct.
Chinese: 本文揭示了安全强化学习易受后门攻击的风险,通过提出正负动作样本(PNAct)框架,在正常条件下维持智能体性能的同时诱导其执行危险动作。
English: This paper reveals that Safe Reinforcement Learning (Safe RL) is susceptible to backdoor attacks through a novel Positive and Negative Action sample (PNAct) framework, which manipulates agents into unsafe actions while maintaining performance under normal conditions.
Authors:Kiyoung Om, Kyuil Sim, Taeyoung Yun, Hyeongyu Kang, Jinkyoo Park
Abstract:
Optimizing high-dimensional black-box functions under black-box constraints is a pervasive task in a wide range of scientific and engineering problems. These problems are typically harder than unconstrained problems due to hard-to-find feasible regions. While Bayesian optimization (BO) methods have been developed to solve such problems, they often struggle with the curse of dimensionality. Recently, generative model-based approaches have emerged as a promising alternative for constrained optimization. However, they suffer from poor scalability and are vulnerable to mode collapse, particularly when the target distribution is highly multi-modal. In this paper, we propose a new framework to overcome these challenges. Our method iterates through two stages. First, we train flow-based models to capture the data distribution and surrogate models that predict both function values and constraint violations with uncertainty quantification. Second, we cast the candidate selection problem as a posterior inference problem to effectively search for promising candidates that have high objective values while not violating the constraints. During posterior inference, we find that the posterior distribution is highly multi-modal and has a large plateau due to constraints, especially when constraint feedback is given as binary indicators of feasibility. To mitigate this issue, we amortize the sampling from the posterior distribution in the latent space of flow-based models, which is much smoother than that in the data space. We empirically demonstrate that our method achieves superior performance on various synthetic and real-world constrained black-box optimization tasks. Our code is publicly available \href{https://github.com/umkiyoung/CiBO}{here}.
中文: 本文提出了一种结合基于流的生成模型与代理模型的新框架,通过潜在空间中的摊销采样有效处理高维约束黑盒优化中的多模态后验分布问题。
English: This paper introduces a novel framework for high-dimensional constrained black-box optimization that combines flow-based generative models with surrogate modeling to efficiently handle multi-modal posterior distributions and overcome scalability issues.
Authors:Yaofei Duan, Yuhao Huang, Xin Yang, Luyi Han, Xinyu Xie, Zhiyuan Zhu, Ping He, Ka-Hou Chan, Ligang Cui, Sio-Kei Im, Dong Ni, Tao Tan
Abstract:
Deep learning-based diagnostic models often suffer performance drops due to distribution shifts between training (source) and test (target) domains. Collecting and labeling sufficient target domain data for model retraining represents an optimal solution, yet is limited by time and scarce resources. Active learning (AL) offers an efficient approach to reduce annotation costs while maintaining performance, but struggles to handle the challenge posed by distribution variations across different datasets. In this study, we propose a novel unsupervised Active learning framework for Domain Adaptation, named ADAptation, which efficiently selects informative samples from multi-domain data pools under limited annotation budget. As a fundamental step, our method first utilizes the distribution homogenization capabilities of diffusion models to bridge cross-dataset gaps by translating target images into source-domain style. We then introduce two key innovations: (a) a hypersphere-constrained contrastive learning network for compact feature clustering, and (b) a dual-scoring mechanism that quantifies and balances sample uncertainty and representativeness. Extensive experiments on four breast ultrasound datasets (three public and one in-house/multi-center) across five common deep classifiers demonstrate that our method surpasses existing strong AL-based competitors, validating its effectiveness and generalization for clinical domain adaptation. The code is available at the anonymized link: https://github.com/miccai25-966/ADAptation.
中文: 本研究提出名为ADAptation的无监督主动学习框架,通过扩散模型实现风格迁移,结合对比学习和双评分机制,在有限标注预算下高效选择信息样本,有效解决诊断模型因域偏移导致的性能下降问题,在多个乳腺超声数据集中展现出优越性能。
English: This study introduces ADAptation, an unsupervised active learning framework that addresses performance degradation in diagnostic models due to domain shifts by using diffusion models for style translation and incorporating contrastive learning with a dual-scoring mechanism to efficiently select informative samples under limited annotation budgets, demonstrating superior performance across multiple breast ultrasound datasets.
Authors:Xin Luo, Menglin Zhang, Yunwei Lan, Tianyu Zhang, Rui Li, Chang Liu, Dong Liu
Abstract:
The Perception-Distortion tradeoff (PD-tradeoff) theory suggests that face restoration algorithms must balance perceptual quality and fidelity. To achieve minimal distortion while maintaining perfect perceptual quality, Posterior-Mean Rectified Flow (PMRF) proposes a flow based approach where source distribution is minimum distortion estimations. Although PMRF is shown to be effective, its pixel-space modeling approach limits its ability to align with human perception, where human perception is defined as how humans distinguish between two image distributions. In this work, we propose Latent-PMRF, which reformulates PMRF in the latent space of a variational autoencoder (VAE), facilitating better alignment with human perception during optimization. By defining the source distribution on latent representations of minimum distortion estimation, we bound the minimum distortion by the VAE's reconstruction error. Moreover, we reveal the design of VAE is crucial, and our proposed VAE significantly outperforms existing VAEs in both reconstruction and restoration. Extensive experiments on blind face restoration demonstrate the superiority of Latent-PMRF, offering an improved PD-tradeoff compared to existing methods, along with remarkable convergence efficiency, achieving a 5.79X speedup over PMRF in terms of FID. Our code will be available as open-source.
Chinese: Latent-PMRF通过在后验均值整流流中引入变分自编码器的潜在空间重构,显著提升了人脸恢复的感知-失真平衡,并实现了比现有方法快5.79倍的收敛效率。
English: Latent-PMRF enhances face restoration by reformulating the Posterior-Mean Rectified Flow in a variational autoencoder's latent space, achieving superior perceptual-distortion tradeoff and a 5.79X speedup in convergence efficiency over prior methods.
Authors:Chengjie Liu, Jiajia Li, Yabing Feng, Wenhao Huang, Weiyu Chen, Yuan Du, Jun Yang, Li Du
Abstract:
Analog circuit design consists of the pre-layout and layout phases. Among them, the pre-layout phase directly decides the final circuit performance, but heavily depends on experienced engineers to do manual design according to specific application scenarios. To overcome these challenges and automate the analog circuit pre-layout design phase, we introduce DiffCkt: a diffusion model-based hybrid neural network framework for the automatic transistor-level generation of analog circuits, which can directly generate corresponding circuit structures and device parameters tailored to specific performance requirements. To more accurately quantify the efficiency of circuits generated by DiffCkt, we introduce the Circuit Generation Efficiency Index (CGEI), which is determined by both the figure of merit (FOM) of a single generated circuit and the time consumed. Compared with relative research, DiffCkt has improved CGEI by a factor of $2.21 \sim 8365\times$, reaching a state-of-the-art (SOTA) level. In conclusion, this work shows that the diffusion model has the remarkable ability to learn and generate analog circuit structures and device parameters, providing a revolutionary method for automating the pre-layout design of analog circuits. The circuit dataset will be open source, its preview version is available at https://github.com/CjLiu-NJU/DiffCkt.
中文: DiffCkt是一种基于扩散模型的框架,能根据特定性能要求自动生成模拟电路的预布局结构和器件参数,其电路生成效率指数比现有研究提高了2.21至8365倍,达到了领先水平。
English: DiffCkt is a diffusion model-based framework that automates analog circuit pre-layout design by generating tailored circuit structures and parameters, achieving state-of-the-art efficiency with a Circuit Generation Efficiency Index improvement of 2.21 to 8365 times compared to existing methods.
Authors:Yujia Yin, Tianyi Qu, Zihao Wang, Yifan Chen
Abstract:
Through recognizing causal subgraphs, causal graph learning (CGL) has risen to be a promising approach for improving the generalizability of graph neural networks under out-of-distribution (OOD) scenarios. However, the empirical successes of CGL techniques are mostly exemplified in classification settings, while regression tasks, a more challenging setting in graph learning, are overlooked. We thus devote this work to tackling causal graph regression (CGR); to this end we reshape the processing of confounding effects in existing CGL studies, which mainly deal with classification. Specifically, we reflect on the predictive power of confounders in graph-level regression, and generalize classification-specific causal intervention techniques to regression through a lens of contrastive learning. Extensive experiments on graph OOD benchmarks validate the efficacy of our proposals for CGR. The model implementation and the code are provided on https://github.com/causal-graph/CGR.
Chinese: 本研究通过对比学习将因果干预技术从分类任务推广到回归任务,提出了因果图回归方法,在分布外图学习基准上的大量实验验证了其有效性。
English: This work introduces causal graph regression (CGR) by adapting causal intervention techniques from classification to regression through contrastive learning, enhancing generalizability in out-of-distribution graph scenarios as validated by extensive experiments.
Authors:Huanxin Yang, Qiwen Wang
Abstract:
Handwritten mathematical expression recognition (HMER) suffers from complex formula structures and character layouts in sequence prediction. In this paper, we incorporate frequency domain analysis into HMER and propose a method that marries frequency domain with HMER (MFH), leveraging the discrete cosine transform (DCT). We emphasize the structural analysis assistance of frequency information for recognizing mathematical formulas. When implemented on various baseline models, our network exhibits a consistent performance enhancement, demonstrating the efficacy of frequency domain information. Experiments show that our MFH-CoMER achieves noteworthy accuracyrates of 61.66%/62.07%/63.72% on the CROHME 2014/2016/2019 test sets. The source code is available at https://github.com/Hryxyhe/MFH.
中文: 本文提出MFH方法,通过引入离散余弦变换进行频域分析来增强手写数学表达式识别,有效提升结构分析能力,在多种基线模型上均实现性能提升,并在CROHME数据集上取得了显著准确率。
English: This paper introduces MFH, a method that integrates frequency domain analysis using discrete cosine transform to enhance handwritten mathematical expression recognition by improving structural analysis, which consistently boosts performance across baseline models and achieves high accuracy on CROHME datasets.
Authors:Xin Xu, Eibe Frank, Geoffrey Holmes
Abstract:
We investigate cross-domain few-shot learning under the constraint that fine-tuning of backbones (i.e., feature extractors) is impossible or infeasible -- a scenario that is increasingly common in practical use cases. Handling the low-quality and static embeddings produced by frozen, "black-box" backbones leads to a problem representation of few-shot classification as a series of multiple instance verification (MIV) tasks. Inspired by this representation, we introduce a novel approach to few-shot domain adaptation, named the "MIV-head", akin to a classification head that is agnostic to any pretrained backbone and computationally efficient. The core components designed for the MIV-head, when trained on few-shot data from a target domain, collectively yield strong performance on test data from that domain. Importantly, it does so without fine-tuning the backbone, and within the "meta-testing" phase. Experimenting under various settings and on an extension of the Meta-dataset benchmark for cross-domain few-shot image classification, using representative off-the-shelf convolutional neural network and vision transformer backbones pretrained on ImageNet1K, we show that the MIV-head achieves highly competitive accuracy when compared to state-of-the-art "adapter" (or partially fine-tuning) methods applied to the same backbones, while incurring substantially lower adaptation cost. We also find well-known "classification head" approaches lag far behind in terms of accuracy. Ablation study empirically justifies the core components of our approach. We share our code at https://github.com/xxweka/MIV-head.
Chinese: 本研究提出MIV-head方法,将跨域小样本学习视为多实例验证任务,无需微调主干网络即可实现与先进方法相媲美的准确率,且适应成本显著降低。
English: This study introduces the MIV-head, a novel approach for cross-domain few-shot learning that treats classification as multiple instance verification tasks, achieving competitive accuracy without fine-tuning backbones and at a lower adaptation cost compared to state-of-the-art methods.
Authors:Jian Wang, Qiongying Ni, Hongkui Yu, Ruixuan Yao, Jinqiao Ying, Bin Zhang, Xingyi Yang, Jin Peng, Jiongquan Chen, Junxuan Yu, Wenlong Shi, Chaoyu Chen, Zhongnuo Yan, Mingyuan Luo, Gaocheng Cai, Dong Ni, Jing Lu, Xin Yang
Abstract:
Accurate fetal birth weight (FBW) estimation is essential for optimizing delivery decisions and reducing perinatal mortality. However, clinical methods for FBW estimation are inefficient, operator-dependent, and challenging to apply in cases of complex fetal anatomy. Existing deep learning methods are based on 2D standard ultrasound (US) images or videos that lack spatial information, limiting their prediction accuracy. In this study, we propose the first method for directly estimating FBW from 3D fetal US volumes. Our approach integrates a multi-scale feature fusion network (MFFN) and a synthetic sample-based learning framework (SSLF). The MFFN effectively extracts and fuses multi-scale features under sparse supervision by incorporating channel attention, spatial attention, and a ranking-based loss function. SSLF generates synthetic samples by simply combining fetal head and abdomen data from different fetuses, utilizing semi-supervised learning to improve prediction performance. Experimental results demonstrate that our method achieves superior performance, with a mean absolute error of $166.4\pm155.9$ $g$ and a mean absolute percentage error of $5.1\pm4.6$%, outperforming existing methods and approaching the accuracy of a senior doctor. Code is available at: https://github.com/Qioy-i/EFW.
中文: 本研究首次提出通过三维超声体积直接估算胎儿体重的方法,结合多尺度特征融合网络和合成样本学习框架,实现了超越现有方法并接近专家水平的预测精度。
English: This study introduces the first method for directly estimating fetal birth weight from 3D ultrasound volumes using a multi-scale feature fusion network and synthetic sample learning, achieving superior accuracy that rivals expert assessments.
Authors:Geng Zhang, Shenggan Cheng, Xuanlei Zhao, Ziming Liu, Yang You
Abstract:
As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel pipeline parallelism for long sequence transformer training. First, HelixPipe introduces attention parallel partition, which schedules attention computations of different micro batches across different pipeline stages in parallel, reducing pipeline bubbles. Second, it employs a two-fold first-in-last-out micro batch schedule to balance memory usage and overlap communication with computation. Additionally, HelixPipe utilizes recomputation without attention and chunked MLP to mitigate fragmentation and enable longer sequences. Experiments demonstrate that HelixPipe gains increasing advantages with longer sequence lengths, and outperforms existing methods in throughput and scalability across varying pipeline sizes, model sizes, and cluster configurations. Notably, it achieves a 26\% speedup over baseline methods when training a 7B model with 128k sequence length on 64 H20 GPUs. Code is available at https://github.com/code-tunnel/Megatron-LM/tree/dev.
中文:HelixPipe是一种新颖的流水线并行方法,通过并行化注意力计算和优化内存使用,有效提升长序列Transformer训练的性能,相比现有方法具有显著优势。
English: HelixPipe is a novel pipeline parallelism method that enhances transformer training for long sequences by parallelizing attention computations and optimizing memory usage, achieving significant performance improvements over existing approaches.
Authors:Yingping Liang, Yutao Hu, Wenqi Shao, Ying Fu
Abstract:
Feature matching plays a fundamental role in many computer vision tasks, yet existing methods heavily rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as \textbf{Lift to Match (L2M)}, taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novel-view rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching.
Chinese: 提出的Lift to Match (L2M)框架通过多视图合成和3D高斯表示学习3D感知特征编码器,再结合新视角渲染与合成数据实现跨领域泛化,显著提升了特征匹配的鲁棒性。
English: The proposed Lift to Match (L2M) framework enhances feature matching by first learning a 3D-aware feature encoder through multi-view synthesis and 3D Gaussian representation, then employing novel-view rendering with synthetic data to achieve robust generalization across diverse domains.
Authors:Geng Zhang, Yuxuan Han, Yuxuan Lou, Wangbo Zhao, Yiqi Zhang, Yang You
Abstract:
Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token. However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory. While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes. This paper proposes Mixture-of-Novices-and-Experts (MoNE), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression. MoNE evaluates expert redundancy based on two metrics: access frequency and output variance. Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novices-unbiased estimations of their original outputs-minimizing performance degradation. Extensive experiments demonstrate that MoNE consistently outperforms baseline methods with minimal accuracy degradation across the three dimensions, confirming its effectiveness and robustness. Notably, it improves the average zero shot accuracy across nine downstream tasks by up to 2.71 under 25\% pruning ratio and 3.61 under 50\% pruning. The code is available at https://github.com/zxgx/mode-pd.
中文: 本文提出混合新手与专家(MoNE)方法,通过用轻量级新手替代冗余专家来实现稳健的模型压缩,在多种架构和数据条件下均以最小精度损失超越基线方法。
English: The paper introduces Mixture-of-Novices-and-Experts (MoNE), an expert pruning method that replaces redundant experts with lightweight novices to achieve robust model compression, outperforming baselines with minimal accuracy loss across various architectures and data conditions.
Authors:Jing Ren, Wenhao Zhou, Bowen Li, Mujie Liu, Nguyen Linh Dan Le, Jiade Cen, Liping Chen, Ziqi Xu, Xiwei Xu, Xiaodong Li
Abstract:
Implicit Sentiment Analysis (ISA) aims to infer sentiment that is implied rather than explicitly stated, requiring models to perform deeper reasoning over subtle contextual cues. While recent prompting-based methods using Large Language Models (LLMs) have shown promise in ISA, they often rely on majority voting over chain-of-thought (CoT) reasoning paths without evaluating their causal validity, making them susceptible to internal biases and spurious correlations. To address this challenge, we propose CAPITAL, a causal prompting framework that incorporates front-door adjustment into CoT reasoning. CAPITAL decomposes the overall causal effect into two components: the influence of the input prompt on the reasoning chains, and the impact of those chains on the final output. These components are estimated using encoder-based clustering and the NWGM approximation, with a contrastive learning objective used to better align the encoder's representation with the LLM's reasoning space. Experiments on benchmark ISA datasets with three LLMs demonstrate that CAPITAL consistently outperforms strong prompting baselines in both accuracy and robustness, particularly under adversarial conditions. This work offers a principled approach to integrating causal inference into LLM prompting and highlights its benefits for bias-aware sentiment reasoning. The source code and case study are available at: https://github.com/whZ62/CAPITAL.
中文摘要:提出的CAPITAL框架通过将因果推理融入思维链过程,显著提升了隐式情感分析的准确性和鲁棒性,在多种大语言模型上均优于现有方法。
English Summary: The proposed CAPITAL framework enhances implicit sentiment analysis by integrating causal inference with chain-of-thought reasoning, demonstrating superior accuracy and robustness across multiple large language models.
Authors:Jianhao Xie, Ziang Zhang, Zhenyu Weng, Yuesheng Zhu, Guibo Luo
Abstract:
Recent advancements in deep learning for medical image segmentation are often limited by the scarcity of high-quality training data.While diffusion models provide a potential solution by generating synthetic images, their effectiveness in medical imaging remains constrained due to their reliance on large-scale medical datasets and the need for higher image quality. To address these challenges, we present MedDiff-FT, a controllable medical image generation method that fine-tunes a diffusion foundation model to produce medical images with structural dependency and domain specificity in a data-efficient manner. During inference, a dynamic adaptive guiding mask enforces spatial constraints to ensure anatomically coherent synthesis, while a lightweight stochastic mask generator enhances diversity through hierarchical randomness injection. Additionally, an automated quality assessment protocol filters suboptimal outputs using feature-space metrics, followed by mask corrosion to refine fidelity. Evaluated on five medical segmentation datasets,MedDiff-FT's synthetic image-mask pairs improve SOTA method's segmentation performance by an average of 1% in Dice score. The framework effectively balances generation quality, diversity, and computational efficiency, offering a practical solution for medical data augmentation. The code is available at https://github.com/JianhaoXie1/MedDiff-FT.
中文摘要:MedDiff-FT提出了一种可控的医学图像生成方法,通过微调扩散基础模型,在保证解剖结构一致性的同时高效生成具有领域特性的医学图像,在五个分割数据集上使现有最佳方法的性能平均提升1% Dice分数。
English Summary: MedDiff-FT is a novel diffusion-based method that fine-tunes foundation models to generate high-quality medical images with anatomical coherence and domain specificity, improving segmentation performance by 1% Dice score across five datasets while maintaining computational efficiency.
Authors:Chuyan Zhang, Kefan Wang, Yun Gu
Abstract:
Low-Rank Adaptation (LoRA) has proven effective in reducing computational costs while maintaining performance comparable to fully fine-tuned foundation models across various tasks. However, its fixed low-rank structure restricts its adaptability in scenarios with substantial domain gaps, where higher ranks are often required to capture domain-specific complexities. Current adaptive LoRA methods attempt to overcome this limitation by dynamically expanding or selectively allocating ranks, but these approaches frequently depend on computationally intensive techniques such as iterative pruning, rank searches, or additional regularization. To address these challenges, we introduce Stable Rank-Guided Low-Rank Adaptation (SR-LoRA), a novel framework that utilizes the stable rank of pre-trained weight matrices as a natural prior for layer-wise rank allocation. By leveraging the stable rank, which reflects the intrinsic dimensionality of the weights, SR-LoRA enables a principled and efficient redistribution of ranks across layers, enhancing adaptability without incurring additional search costs. Empirical evaluations on few-shot tasks with significant domain gaps show that SR-LoRA consistently outperforms recent adaptive LoRA variants, achieving a superior trade-off between performance and efficiency. Our code is available at https://github.com/EndoluminalSurgicalVision-IMR/SR-LoRA.
Chinese: SR-LoRA提出了一种基于稳定秩指导的低秩自适应框架,通过利用预训练权重矩阵的内在维度实现层级间秩的智能分配,在存在显著领域差异的任务中无需额外搜索成本即可获得更优的性能与效率平衡。
English: SR-LoRA introduces a stable rank-guided framework that efficiently reallocates ranks across layers using pre-trained weight matrices' intrinsic dimensionality, achieving superior performance and efficiency in domain-gap tasks without additional search costs.
Authors:Siyou Li, Pengyao Qin, Huanan Wu, Dong Nie, Arun J. Thirunavukarasu, Juntao Yu, Le Zhang
Abstract:
Automated radiology report generation (RRG) aims to produce detailed textual reports from clinical imaging, such as computed tomography (CT) scans, to improve the accuracy and efficiency of diagnosis and provision of management advice. RRG is complicated by two key challenges: (1) inherent complexity in extracting relevant information from imaging data under resource constraints, and (2) difficulty in objectively evaluating discrepancies between model-generated and expert-written reports. To address these challenges, we propose $μ^2$LLM, a $\underline{\textbf{mu}}$ltiscale $\underline{\textbf{mu}}$ltimodal large language models for RRG tasks. The novel $μ^2$Tokenizer, as an intermediate layer, integrates multi-modal features from the multiscale visual tokenizer and the text tokenizer, then enhances report generation quality through direct preference optimization (DPO), guided by GREEN-RedLlama. Experimental results on four large CT image-report medical datasets demonstrate that our method outperforms existing approaches, highlighting the potential of our fine-tuned $μ^2$LLMs on limited data for RRG tasks. At the same time, for prompt engineering, we introduce a five-stage, LLM-driven pipeline that converts routine CT reports into paired visual-question-answer triples and citation-linked reasoning narratives, creating a scalable, high-quality supervisory corpus for explainable multimodal radiology LLM. All code, datasets, and models will be publicly available in our official repository. https://github.com/Siyou-Li/u2Tokenizer
中文: 提出的μ²LLM模型通过新型分词器整合多尺度多模态特征,结合优化方法显著提升了放射学报告的自动生成质量,在CT数据集上表现优于现有方法,并建立了可扩展的生成式训练数据管道。
English: The proposed μ²LLM model enhances automated radiology report generation by integrating multiscale multimodal features through a novel tokenizer and optimization method, outperforming existing approaches on CT datasets while also introducing a scalable pipeline for creating explainable training data.
Authors:Yusuke Tanaka, Alvin Zhu, Quanyou Wang, Dennis Hong
Abstract:
Reinforcement learning (RL) has enabled advances in humanoid robot locomotion, yet most learning frameworks do not account for mechanical intelligence embedded in parallel actuation mechanisms due to limitations in simulator support for closed kinematic chains. This omission can lead to inaccurate motion modeling and suboptimal policies, particularly for robots with high actuation complexity. This paper presents general formulations and simulation methods for three types of parallel mechanisms: a differential pulley, a five-bar linkage, and a four-bar linkage, and trains a parallel-mechanism aware policy through an end-to-end curriculum RL framework for BRUCE, a kid-sized humanoid robot. Unlike prior approaches that rely on simplified serial approximations, we simulate all closed-chain constraints natively using GPU-accelerated MuJoCo (MJX), preserving the hardware's mechanical nonlinear properties during training. We benchmark our RL approach against a model predictive controller (MPC), demonstrating better surface generalization and performance in real-world zero-shot deployment. This work highlights the computational approaches and performance benefits of fully simulating parallel mechanisms in end-to-end learning pipelines for legged humanoids. Project codes with parallel mechanisms: https://github.com/alvister88/og_bruce
中文摘要:本文提出了一种强化学习框架,通过GPU加速的MuJoCo完整模拟人形机器人中的并联机构,保留了机械非线性特性,相比传统控制器在实际应用中展现出更优的性能。
English Summary: This paper introduces a reinforcement learning framework that fully simulates parallel mechanisms in humanoid robots, using GPU-accelerated MuJoCo to preserve mechanical nonlinearities and demonstrating superior real-world performance compared to traditional controllers.
Authors:Zhuangzhuang Dai, Vincent Gbouna Zakka, Luis J. Manso, Chen Li
Abstract:
Enabling robots to understand human gaze target is a crucial step to allow capabilities in downstream tasks, for example, attention estimation and movement anticipation in real-world human-robot interactions. Prior works have addressed the in-frame target localization problem with data-driven approaches by carefully removing out-of-frame samples. Vision-based gaze estimation methods, such as OpenFace, do not effectively absorb background information in images and cannot predict gaze target in situations where subjects look away from the camera. In this work, we propose a system to address the problem of 360-degree gaze target estimation from an image in generalized visual scenes. The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder. Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios. This makes a first-of-its-kind system to predict gaze targets from realistic camera footage which is highly efficient and deployable. Our source code is made publicly available at: https://github.com/zdai257/DisengageNet.
Chinese: 本文提出了GazeTarget360系统,通过整合眼神接触检测器、预训练视觉编码器和多尺度融合解码器,实现了从图像中进行360度视线目标估计,在未知场景中产生准确预测,超越了OpenFace等现有方法。
English: This paper introduces GazeTarget360, a novel system that enables 360-degree gaze target estimation from images by integrating an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder, achieving accurate predictions in unseen scenarios and outperforming prior methods like OpenFace.
Authors:Sanchit Ahuja, Praneetha Vaddamanu, Barun Patra
Abstract:
Despite recent advances in Language Reasoning Models (LRMs), most research focuses solely on English, even though many models are pretrained on multilingual data. In this work, we investigate: Is English the most token-efficient language for reasoning? We evaluate three open-source RLMs: DeepSeek R1, Qwen 2.5 and Qwen 3, across four math datasets and seven typologically diverse languages. We find that reasoning in non-English languages not only reduces token usage, but also preserves accuracy. These gains persist even after translating the reasoning traces into English, suggesting genuine shifts in reasoning behavior rather than surface-level linguistic effects. The extent of improvement, however, depends on the models multilingual strength. Our findings motivate a broader view of reasoning in language models, highlighting the potential of multilingual reasoning and the importance of strong multilingual foundations. The code for our work can be found: https://github.com/microsoft/EfficientXLang.
Chinese Summary: 研究发现,在语言模型推理中,使用非英语语言不仅能减少令牌消耗且保持准确性,这种优势在翻译后依然存在,凸显了多语言能力的重要性。
English Summary: Non-English languages can be more token-efficient for reasoning in language models while maintaining accuracy, with benefits persisting even after translation, highlighting the importance of multilingual capabilities.
Authors:Hoang-Dieu Vu, Duc-Nghia Tran, Quang-Tu Pham, Hieu H. Pham, Nicolas Vuillerme, Duc-Tan Tran
Abstract:
This paper introduces Smooth-Distill, a novel self-distillation framework designed to simultaneously perform human activity recognition (HAR) and sensor placement detection using wearable sensor data. The proposed approach utilizes a unified CNN-based architecture, MTL-net, which processes accelerometer data and branches into two outputs for each respective task. Unlike conventional distillation methods that require separate teacher and student models, the proposed framework utilizes a smoothed, historical version of the model itself as the teacher, significantly reducing training computational overhead while maintaining performance benefits. To support this research, we developed a comprehensive accelerometer-based dataset capturing 12 distinct sleep postures across three different wearing positions, complementing two existing public datasets (MHealth and WISDM). Experimental results show that Smooth-Distill consistently outperforms alternative approaches across different evaluation scenarios, achieving notable improvements in both human activity recognition and device placement detection tasks. This method demonstrates enhanced stability in convergence patterns during training and exhibits reduced overfitting compared to traditional multitask learning baselines. This framework contributes to the practical implementation of knowledge distillation in human activity recognition systems, offering an effective solution for multitask learning with accelerometer data that balances accuracy and training efficiency. More broadly, it reduces the computational cost of model training, which is critical for scenarios requiring frequent model updates or training on resource-constrained platforms. The code and model are available at https://github.com/Kuan2vn/smooth\_distill.
中文: Smooth-Distill是一种自蒸馏框架,通过统一的CNN架构同时进行人体活动识别和传感器位置检测,在保持高精度的同时显著降低了计算成本。
English: Smooth-Distill is a self-distillation framework that efficiently performs human activity recognition and sensor placement detection using a unified CNN architecture, reducing computational costs while maintaining high accuracy.
Authors:Mehmet Yigit Avci, Pedro Borges, Paul Wright, Mehmet Yigitsoy, Sebastien Ourselin, Jorge Cardoso
Abstract:
Accurate interpretation of Magnetic Resonance Imaging scans in clinical systems is based on a precise understanding of image contrast. This contrast is primarily governed by acquisition parameters, such as echo time and repetition time, which are stored in the DICOM metadata. To simplify contrast identification, broad labels such as T1-weighted or T2-weighted are commonly used, but these offer only a coarse approximation of the underlying acquisition settings. In many real-world datasets, such labels are entirely missing, leaving raw acquisition parameters as the only indicators of contrast. Adding to this challenge, the available metadata is often incomplete, noisy, or inconsistent. The lack of reliable and standardized metadata complicates tasks such as image interpretation, retrieval, and integration into clinical workflows. Furthermore, robust contrast-aware representations are essential to enable more advanced clinical applications, such as achieving modality-invariant representations and data harmonization. To address these challenges, we propose MR-CLIP, a multimodal contrastive learning framework that aligns MR images with their DICOM metadata to learn contrast-aware representations, without relying on manual labels. Trained on a diverse clinical dataset that spans various scanners and protocols, MR-CLIP captures contrast variations across acquisitions and within scans, enabling anatomy-invariant representations. We demonstrate its effectiveness in cross-modal retrieval and contrast classification, highlighting its scalability and potential for further clinical applications. The code and weights are publicly available at https://github.com/myigitavci/MR-CLIP.
中文: MR-CLIP是一种多模态框架,通过将磁共振图像与DICOM元数据对齐来学习对比度感知表征,无需人工标注即可实现跨模态检索和对比度分类等稳健应用。
English: MR-CLIP is a multimodal framework that aligns MRI images with DICOM metadata to learn contrast-aware representations, enabling robust applications like cross-modal retrieval and contrast classification without manual labels.
Authors:Phoomraphee Luenam, Andreas Spanopoulos, Amit Sant, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh
Abstract:
Model fusion aims to combine the knowledge of multiple models by creating one representative model that captures the strengths of all of its parents. However, this process is non-trivial due to differences in internal representations, which can stem from permutation invariance, random initialization, or differently distributed training data. We present a novel, neuron-centric family of model fusion algorithms designed to integrate multiple trained neural networks into a single network effectively regardless of training data distribution. Our algorithms group intermediate neurons of parent models to create target representations that the fused model approximates with its corresponding sub-network. Unlike prior approaches, our approach incorporates neuron attribution scores into the fusion process. Furthermore, our algorithms can generalize to arbitrary layer types. Experimental results on various benchmark datasets demonstrate that our algorithms consistently outperform previous fusion techniques, particularly in zero-shot and non-IID fusion scenarios. The code is available at https://github.com/AndrewSpano/neuron-interpolation-model-fusion.
Chinese: 本文提出了一种以神经元为中心的模型融合方法,通过分组神经元并引入归因评分,将多个神经网络有效整合为单一模型,在多种场景下优于现有技术,且代码已公开。
English: This paper introduces a neuron-centric model fusion method that effectively integrates multiple neural networks into a single model by grouping neurons and incorporating attribution scores, outperforming prior techniques across diverse scenarios with publicly available code.
Authors:Tiexin Qin, Hong Yan, Haoliang Li
Abstract:
Learning the underlying dynamics from data with deep neural networks has shown remarkable potential in modeling various complex physical dynamics. However, current approaches are constrained in their ability to make reliable predictions in a specific domain and struggle with generalizing to unseen systems that are governed by the same general dynamics but differ in environmental characteristics. In this work, we formulate a parameter-efficient method, Fourier Neural Simulator for Dynamical Adaptation (FNSDA), that can readily generalize to new dynamics via adaptation in the Fourier space. Specifically, FNSDA identifies the shareable dynamics based on the known environments using an automatic partition in Fourier modes and learns to adjust the modes specific for each new environment by conditioning on low-dimensional latent systematic parameters for efficient generalization. We evaluate our approach on four representative families of dynamic systems, and the results show that FNSDA can achieve superior or competitive generalization performance compared to existing methods with a significantly reduced parameter cost. Our code is available at https://github.com/WonderSeven/FNSDA.
中文摘要:本研究提出FNSDA方法,通过傅里叶空间自适应实现动力学系统的高效泛化,在四个典型动态系统中以显著降低的参数成本达到优越的泛化性能。
English Summary: The study introduces FNSDA, a parameter-efficient method that generalizes to unseen dynamical systems by adapting Fourier modes, achieving superior performance with reduced parameters across four dynamic systems.
Authors:Thomas Joshi, Shayan Chowdhury, Fatih Uysal
Abstract:
Large Language Models (LLMs) have achieved impressive results on static code-generation benchmarks, but real-world software development unfolds as a continuous stream of evolving issues, fixes, and feature requests. We introduce SWE-Bench-CL, a novel continual learning benchmark built on the human-verified SWE-Bench Verified dataset introduced by OpenAI and Princeton-NLP in 2024. By organizing GitHub issues into chronologically ordered sequences that reflect natural repository evolution, SWE-Bench-CL enables direct evaluation of an agent's ability to accumulate experience, transfer knowledge across tasks, and resist catastrophic forgetting. We complement the dataset with (i) a preliminary analysis of inter-task structural similarity and contextual sensitivity, (ii) an interactive LangGraph-based evaluation framework augmented with a FAISS-backed semantic memory module, and (iii) a suite of specialized continual learning metrics -- including average accuracy, forgetting, forward/backward transfer, tool-use efficiency, and a generalized Composite Continual Learning Score and CL-F-beta score -- to capture the stability-plasticity trade-off. We outline a rigorous experimental protocol comparing memory-enabled and memory-disabled agents across diverse Python repositories. All code and data are publicly available at https://github.com/thomasjoshi/agents-never-forget, providing the community with a reproducible platform for developing more adaptive and robust AI agents in software engineering.
中文摘要:SWE-Bench-CL是一个基于时间顺序GitHub问题序列的新型持续学习基准,通过专业评估指标和交互式框架,系统检验AI智能体在软件工程中积累经验、迁移知识和抗遗忘的持续学习能力。
English Summary: SWE-Bench-CL is a new continual learning benchmark that evaluates AI agents' ability to handle evolving software development tasks through chronological GitHub issue sequences, featuring specialized metrics and an interactive evaluation framework.
Authors:Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofen Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong
Abstract:
Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.
中文: 大语言模型因缺乏有效的记忆管理系统而制约了长期推理能力,但提出的MemOS系统以MemCube为基本单元构建统一记忆框架,通过可调度、可演化的记忆操作实现高效存储与检索,为持续学习和个性化建模奠定基础。
English: Large Language Models currently lack effective memory management, limiting long-term reasoning and knowledge consistency, but the proposed MemOS system introduces a unified memory framework with MemCubes as basic units to enable controllable and evolvable memory operations for improved efficiency and adaptability.
Authors:Yunyi Zhao, Wei Zhang, Cheng Xiang, Hongyang Du, Dusit Niyato, Shuhua Gao
Abstract:
This paper introduces DiffCarl, a diffusion-modeled carbon- and risk-aware reinforcement learning algorithm for intelligent operation of multi-microgrid systems. With the growing integration of renewables and increasing system complexity, microgrid communities face significant challenges in real-time energy scheduling and optimization under uncertainty. DiffCarl integrates a diffusion model into a deep reinforcement learning (DRL) framework to enable adaptive energy scheduling under uncertainty and explicitly account for carbon emissions and operational risk. By learning action distributions through a denoising generation process, DiffCarl enhances DRL policy expressiveness and enables carbon- and risk-aware scheduling in dynamic and uncertain microgrid environments. Extensive experimental studies demonstrate that it outperforms classic algorithms and state-of-the-art DRL solutions, with 2.3-30.1% lower operational cost. It also achieves 28.7% lower carbon emissions than those of its carbon-unaware variant and reduces performance variability. These results highlight DiffCarl as a practical and forward-looking solution. Its flexible design allows efficient adaptation to different system configurations and objectives to support real-world deployment in evolving energy systems.
中文:DiffCarl是一种基于扩散模型的强化学习算法,通过自适应调度不确定环境下的能源来优化多微电网运行,同时降低碳排放和运营成本。
English: DiffCarl is a diffusion-modeled reinforcement learning algorithm that optimizes multi-microgrid operations by adaptively scheduling energy under uncertainty while reducing carbon emissions and operational costs.
Authors:Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen
Abstract:
We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.
中文摘要:本文提出策略判别学习(POLAR)这一新型奖励建模方法,将奖励信号构建为策略判别器以引导训练策略趋近目标行为,实验表明该方法能显著提升多项任务和模型的性能表现。
English Summary: This paper introduces Policy Discriminative Learning (POLAR), a novel reward modeling approach that frames reward signals as policy discriminators to guide training policies toward desired behaviors, demonstrating substantial performance improvements across various tasks and models.
Authors:Xudong Wang, Lei Feng, Jiacheng Wang, Hongyang Du, Changyuan Zhao, Wenjing Li, Zehui Xiong, Dusit Niyato, Ping Zhang
Abstract:
As a key component of low-altitude economic networks, aerial base stations (AeBSs) provide flexible and reliable wireless coverage to support 6G ultra-reliable and low-latency communication (URLLC) services. However, limited spectrum resources and severe co-channel interference pose significant challenges to the deployment and resource allocation of AeBSs. To address these limitations, this paper proposes a novel rate-splitting multiple access (RSMA)-enabled transmission design to flexibly manage interference and effectively enhance URLLC services in spectrum-constrained multi-AeBS networks. On this basis, we formulate a joint optimization problem involving AeBS deployment, user association, and resource allocation to maximize the achievable sum rate and coverage of the total system. Given the NP-hard nature of the problem and the highly dynamic environment, we propose a novel alternating optimization framework based on the generative graph diffusion models. Specifically, we model AeBSs and ground users as graph nodes, then we employ a discrete graph generation process solved via denoising diffusion is employed to explore the combinatorial space of deployment and association strategies. Moreover, the algorithm adopts the successive convex approximation (SCA) method to optimize AeBS beamforming and RSMA rate allocation under finite blocklength constraints. Extensive simulations demonstrate that the proposed algorithm outperforms existing methods in terms of convergence speed, sum rate, and coverage, while also exhibiting robust performance under varying network densities and interference levels.
中文摘要:本文提出了一种基于速率分割多址接入(RSMA)的传输方案,并结合生成式图扩散模型的交替优化框架,通过联合优化空中基站部署、用户关联和资源分配,有效提升频谱受限多基站网络中的超可靠低时延通信服务性能。
English Summary: This paper introduces a rate-splitting multiple access (RSMA) transmission design combined with an alternating optimization framework using generative graph diffusion models to enhance URLLC services in multi-aerial base station networks by jointly optimizing deployment, association, and resource allocation.
Authors:Zhiheng Xi, Guanyu Li, Yutao Fan, Honglin Guo, Yufang Liu, Xiaoran Fan, Jiaqi Liu, Jingchao Ding, Wangmeng Zuo, Zhenfei Yin, Lei Bai, Tao Ji, Tao Gui, Qi Zhang, Philip Torr, Xuanjing Huang
Abstract:
In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs' knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.
中文: 本文介绍了BMMR,一个大规模双语多模态数据集,包含11万个大学级别问题,涵盖300个学科,通过精心设计的训练和评估子集推动大型多模态模型发展,实验表明即使顶尖模型仍存在明显性能差距。
English: This paper introduces BMMR, a large-scale bilingual multimodal dataset with 110k college-level questions across 300 subjects, designed to advance and evaluate large multimodal models through curated training and evaluation subsets, revealing significant performance gaps even in state-of-the-art models.
Authors:Feng Hong, Geng Yu, Yushi Ye, Haicheng Huang, Huangjie Zheng, Ya Zhang, Yanfeng Wang, Jiangchao Yao
Abstract:
Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model's bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$\times$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$\times$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.
中文摘要:WINO解码算法通过可撤销并行解码机制,有效解决了扩散大语言模型中的质量-速度权衡问题,在多项基准测试中实现了显著加速与性能提升。
English Summary: The WINO decoding algorithm addresses the quality-speed trade-off in Diffusion Large Language Models by enabling revokable parallel decoding, achieving significant acceleration and performance improvements across benchmarks.
Authors:Feng Hong, Geng Yu, Yushi Ye, Haicheng Huang, Huangjie Zheng, Ya Zhang, Yanfeng Wang, Jiangchao Yao
Abstract:
Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model's bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$\times$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$\times$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.
中文摘要:WINO解码算法通过可撤销并行解码机制,有效解决了扩散大语言模型中的质量-速度权衡问题,在多项基准测试中实现了显著加速与性能提升。
English Summary: The WINO decoding algorithm addresses the quality-speed trade-off in Diffusion Large Language Models by enabling revokable parallel decoding, achieving significant acceleration and performance improvements across benchmarks.
Authors:Keita Yoneda, Kento Kawaharazuka, Temma Suzuki, Takahiro Hattori, Kei Okada
Abstract:
In recent years, advancements in hardware have enabled quadruped robots to operate with high power and speed, while robust locomotion control using reinforcement learning (RL) has also been realized. As a result, expectations are rising for the automation of tasks such as material transport and exploration in unknown environments. However, autonomous locomotion in rough terrains with significant height variations requires vertical movement, and robots capable of performing such movements stably, along with their control methods, have not yet been fully established. In this study, we developed the quadruped robot KLEIYN, which features a waist joint, and aimed to expand quadruped locomotion by enabling chimney climbing through RL. To facilitate the learning of vertical motion, we introduced Contact-Guided Curriculum Learning (CGCL). As a result, KLEIYN successfully climbed walls ranging from 800 mm to 1000 mm in width at an average speed of 150 mm/s, 50 times faster than conventional robots. Furthermore, we demonstrated that the introduction of a waist joint improves climbing performance, particularly enhancing tracking ability on narrow walls.
中文总结:近期硬件与强化学习的进步使配备腰部关节的四足机器人KLEIYN通过接触引导课程学习,实现了以传统机器人50倍的速度完成垂直烟囱攀爬,显著提升了在狭窄墙面的追踪性能。
English Summary: Recent hardware and reinforcement learning advances have enabled high-speed quadruped robots like KLEIYN with a waist joint to master vertical chimney climbing at unprecedented speeds using Contact-Guided Curriculum Learning, significantly outperforming conventional robots.
Authors:Kento Kawaharazuka, Shintaro Inoue, Yuta Sahara, Keita Yoneda, Temma Suzuki, Kei Okada
Abstract:
Tendon-driven mechanisms are useful from the perspectives of variable stiffness, redundant actuation, and lightweight design, and they are widely used, particularly in hands, wrists, and waists of robots. The design of these wire arrangements has traditionally been done empirically, but it becomes extremely challenging when dealing with complex structures. Various studies have attempted to optimize wire arrangement, but many of them have oversimplified the problem by imposing conditions such as restricting movements to a 2D plane, keeping the moment arm constant, or neglecting wire crossings. Therefore, this study proposes a three-dimensional wire arrangement optimization that takes wire crossings into account. We explore wire arrangements through a multi-objective black-box optimization method that ensures wires do not cross while providing sufficient joint torque along a defined target trajectory. For a 3D link structure, we optimize the wire arrangement under various conditions, demonstrate its effectiveness, and discuss the obtained design solutions.
Chinese: 本研究提出了一种考虑线缆交叉的三维线缆布局优化方法,通过多目标黑盒优化确保线缆无交叉且沿目标轨迹提供足够的关节力矩,适用于复杂机器人结构。
English: This study introduces a three-dimensional wire arrangement optimization method that accounts for wire crossings, using multi-objective black-box optimization to ensure non-crossing wires and adequate joint torque along a target trajectory for complex robotic structures.
Authors:Yasunori Toshimitsu, Kento Kawaharazuka, Akihiro Miki, Kei Okada, Masayuki Inaba
Abstract:
For robots to move in the real world, they must first correctly understand the state of its own body and the tools that it holds. In this research, we propose DIJE, an algorithm to estimate the image Jacobian for every pixel. It is based on an optical flow calculation and a simplified Kalman Filter that can be efficiently run on the whole image in real time. It does not rely on markers nor knowledge of the robotic structure. We use the DIJE in a self-recognition process which can robustly distinguish between movement by the robot and by external entities, even when the motion overlaps. We also propose a visual servoing controller based on DIJE, which can learn to control the robot's body to conduct reaching movements or bimanual tool-tip control. The proposed algorithms were implemented on a physical musculoskeletal robot and its performance was verified. We believe that such global estimation of the visuomotor policy has the potential to be extended into a more general framework for manipulation.
中文摘要:本研究提出DIJE算法,通过光流计算和简化卡尔曼滤波器实时估计每个像素的图像雅可比矩阵,使机器人无需标记或结构知识即可区分自身与外部运动,并实现视觉伺服控制。
English Summary: This research introduces DIJE, a real-time algorithm that estimates image Jacobians for every pixel using optical flow and a simplified Kalman Filter, enabling robots to distinguish self-motion from external motion and perform visual servoing tasks without markers or prior structural knowledge.
Authors:Juntao Liu, Liqiang Niu, Wenchao Chen, Jie Zhou, Fandong Meng
Abstract:
Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder. LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts that preserves critical visual information during compression. Extensive experiments indicate that our LaCo outperforms all existing methods when compressing tokens in the intermediate layers of the vision encoder, demonstrating superior effectiveness. In addition, compared to external compression, our method improves training efficiency beyond 20% and inference throughput over 15% while maintaining strong performance.
中文摘要:提出的LaCo框架在视觉编码器的中间层实现分层视觉令牌压缩,通过空间到通道转换和残差学习保留关键信息,相比现有方法训练效率提升超20%、推理吞吐量提高超15%且性能优异。
English Summary: The proposed LaCo framework introduces layer-wise visual token compression within the vision encoder's intermediate layers, outperforming existing methods by enhancing training efficiency by over 20% and inference throughput by more than 15% while maintaining performance.
Authors:Zongmeng Zhang, Wengang Zhou, Jie Zhao, Houqiang Li
Abstract:
Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.
中文: 本文研究了多模态大语言模型中不同输入模态之间的冲突如何导致幻觉,并提出了三种缓解方法,其中强化学习方法表现最佳。
English: This paper explores how conflicts between different input modalities cause hallucinations in multimodal large language models and proposes three mitigation methods, with reinforcement learning showing the best performance.
Authors:Haoxiang Luo, Yinqiu Liu, Ruichen Zhang, Jiacheng Wang, Gang Sun, Dusit Niyato, Hongfang Yu, Zehui Xiong, Xianbin Wang, Xuemin Shen
Abstract:
Edge computing enables real-time data processing closer to its source, thus improving the latency and performance of edge-enabled AI applications. However, traditional AI models often fall short when dealing with complex, dynamic tasks that require advanced reasoning and multimodal data processing. This survey explores the integration of multi-LLMs (Large Language Models) to address this in edge computing, where multiple specialized LLMs collaborate to enhance task performance and adaptability in resource-constrained environments. We review the transition from conventional edge AI models to single LLM deployment and, ultimately, to multi-LLM systems. The survey discusses enabling technologies such as dynamic orchestration, resource scheduling, and cross-domain knowledge transfer that are key for multi-LLM implementation. A central focus is on trusted multi-LLM systems, ensuring robust decision-making in environments where reliability and privacy are crucial. We also present multimodal multi-LLM architectures, where multiple LLMs specialize in handling different data modalities, such as text, images, and audio, by integrating their outputs for comprehensive analysis. Finally, we highlight future directions, including improving resource efficiency, trustworthy governance multi-LLM systems, while addressing privacy, trust, and robustness concerns. This survey provides a valuable reference for researchers and practitioners aiming to leverage multi-LLM systems in edge computing applications.
中文: 本综述探讨了多大型语言模型系统如何通过协同推理和多模态处理来增强边缘计算,以克服传统AI模型在动态、资源受限环境中的局限性。
English: This survey examines how multi-LLM systems enhance edge computing by enabling collaborative reasoning and multimodal processing to overcome the limitations of traditional AI models in dynamic, resource-constrained environments.
Authors:Taoyong Cui, Zhongyao Wang, Dongzhan Zhou, Yuqiang Li, Lei Bai, Wanli Ouyang, Mao Su, Shufei Zhang
Abstract:
Machine learning interatomic potentials (MLIPs) enable efficient molecular dynamics (MD) simulations with ab initio accuracy and have been applied across various domains in physical science. However, their performance often relies on large-scale labeled training data. While existing pretraining strategies can improve model performance, they often suffer from a mismatch between the objectives of pretraining and downstream tasks or rely on extensive labeled datasets and increasingly complex architectures to achieve broad generalization. To address these challenges, we propose Iterative Pretraining for Interatomic Potentials (IPIP), a framework designed to iteratively improve the predictive performance of MLIP models. IPIP incorporates a forgetting mechanism to prevent iterative training from converging to suboptimal local minima. Unlike general-purpose foundation models, which frequently underperform on specialized tasks due to a trade-off between generality and system-specific accuracy, IPIP achieves higher accuracy and efficiency using lightweight architectures. Compared to general-purpose force fields, this approach achieves over 80% reduction in prediction error and up to 4x speedup in the challenging Mo-S-O system, enabling fast and accurate simulations.
Chinese: 提出的迭代预训练原子间势能(IPIP)框架通过带有遗忘机制的迭代训练改进机器学习原子间势能,在使用轻量级架构的同时实现了超过80%的误差降低和高达4倍的模拟加速。
English: The proposed Iterative Pretraining for Interatomic Potentials (IPIP) framework enhances machine learning interatomic potentials through iterative training with a forgetting mechanism, achieving over 80% error reduction and up to 4x speedup in simulations while using lightweight architectures.
Authors:Zijie Guo, Jiong Wang, Xiaoyu Yue, Wangxu Wei, Zhe Jiang, Wanghan Xu, Ben Fei, Wenlong Zhang, Xinyu Gu, Lijing Cheng, Jing-Jia Luo, Chao Li, Yaqiang Wang, Tao Chen, Wanli Ouyang, Fenghua Ling, Lei Bai
Abstract:
Modern Earth science is at an inflection point. The vast, fragmented, and complex nature of Earth system data, coupled with increasingly sophisticated analytical demands, creates a significant bottleneck for rapid scientific discovery. Here we introduce EarthLink, the first AI agent designed as an interactive copilot for Earth scientists. It automates the end-to-end research workflow, from planning and code generation to multi-scenario analysis. Unlike static diagnostic tools, EarthLink can learn from user interaction, continuously refining its capabilities through a dynamic feedback loop. We validated its performance on a number of core scientific tasks of climate change, ranging from model-observation comparisons to the diagnosis of complex phenomena. In a multi-expert evaluation, EarthLink produced scientifically sound analyses and demonstrated an analytical competency that was rated as comparable to specific aspects of a human junior researcher's workflow. Additionally, its transparent, auditable workflows and natural language interface empower scientists to shift from laborious manual execution to strategic oversight and hypothesis generation. EarthLink marks a pivotal step towards an efficient, trustworthy, and collaborative paradigm for Earth system research in an era of accelerating global change. The system is accessible at our website https://earthlink.intern-ai.org.cn.
中文摘要:EarthLink作为首个地球科学AI协作者,能自动化端到端研究流程并通过交互学习提升能力,推动高效可信的地球系统研究。
English Summary: EarthLink is an AI copilot that automates Earth science workflows, enabling rapid analysis and learning from interactions to support scientific discovery.
Authors:Jiamin Wu, Zichen Ren, Junyu Wang, Pengyu Zhu, Yonghao Song, Mianxin Liu, Qihao Zheng, Lei Bai, Wanli Ouyang, Chunfeng Song
Abstract:
Non-invasive Brain-Computer Interfaces (BCI) offer a safe and accessible means of connecting the human brain to external devices, with broad applications in home and clinical settings to enhance human capabilities. However, the high noise level and limited task-specific data in non-invasive signals constrain decoding capabilities. Recently, the adoption of self-supervised pre-training is transforming the landscape of non-invasive BCI research, enabling the development of brain foundation models to capture generic neural representations from large-scale unlabeled electroencephalography (EEG) signals with substantial noises. However, despite these advances, the field currently lacks comprehensive, practical and extensible benchmarks to assess the utility of the public foundation models across diverse BCI tasks, hindering their widespread adoption. To address this challenge, we present AdaBrain-Bench, a large-scale standardized benchmark to systematically evaluate brain foundation models in widespread non-invasive BCI tasks. AdaBrain-Bench encompasses a diverse collection of representative BCI decoding datasets spanning 7 key applications. It introduces a streamlined task adaptation pipeline integrated with multi-dimensional evaluation metrics and a set of adaptation tools. The benchmark delivers an inclusive framework for assessing generalizability of brain foundation models across key transfer settings, including cross-subject, multi-subject, and few-shot scenarios. We leverage AdaBrain-Bench to evaluate a suite of publicly available brain foundation models and offer insights into practices for selecting appropriate models in various scenarios. We make our benchmark pipeline available to enable reproducible research and external use, offering a continuously evolving platform to foster progress toward robust and generalized neural decoding solutions.
中文: 非侵入式脑机接口面临信号噪声大和任务数据有限的问题,自监督预训练正通过构建脑基础模型推动该领域发展,但缺乏全面的评估基准;为此,我们提出AdaBrain-Bench这一大规模标准化基准,系统评估脑基础模型在多种BCI任务中的表现,提供适应性工具和见解以促进广泛应用。
English: Non-invasive Brain-Computer Interfaces face challenges from noisy signals and limited data, but self-supervised pre-training is advancing the field by enabling brain foundation models, though comprehensive benchmarks are lacking; AdaBrain-Bench is introduced as a large-scale, standardized benchmark to evaluate these models across diverse BCI tasks, providing tools and insights for widespread adoption.
Authors:Changze Lv, Jiang Zhou, Siyu Long, Lihao Wang, Jiangtao Feng, Dongyu Xue, Yu Pei, Hao Wang, Zherui Zhang, Yuchen Cai, Zhiqiang Gao, Ziyuan Ma, Jiakai Hu, Chaochen Gao, Jingjing Gong, Yuxuan Song, Shuyi Zhang, Xiaoqing Zheng, Deyi Xiong, Lei Bai, Wanli Ouyang, Ya-Qin Zhang, Wei-Ying Ma, Bowen Zhou, Hao Zhou
Abstract:
We introduce AMix-1, a powerful protein foundation model built on Bayesian Flow Networks and empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in-context learning mechanism, and test-time scaling algorithm. To guarantee robust scalability, we establish a predictive scaling law and reveal the progressive emergence of structural understanding via loss perspective, culminating in a strong 1.7-billion model. Building on this foundation, we devise a multiple sequence alignment (MSA)-based in-context learning strategy to unify protein design into a general framework, where AMix-1 recognizes deep evolutionary signals among MSAs and consistently generates structurally and functionally coherent proteins. This framework enables the successful design of a dramatically improved AmeR variant with an up to $50\times$ activity increase over its wild type. Pushing the boundaries of protein engineering, we further empower AMix-1 with an evolutionary test-time scaling algorithm for in silico directed evolution that delivers substantial, scalable performance gains as verification budgets are intensified, laying the groundwork for next-generation lab-in-the-loop protein design.
中文摘要:我们推出AMix-1这一强大的蛋白质基础模型,它通过预测性扩展规律和上下文学习策略统一蛋白质设计框架,不仅成功将AmeR变体活性提升50倍,更建立了具备可扩展性能的计算机模拟定向进化系统。
English Summary: We introduce AMix-1, a powerful protein foundation model that integrates predictive scaling laws and in-context learning to design structurally coherent proteins, achieving up to 50× activity improvement and enabling scalable in silico directed evolution.
Authors:Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng
Abstract:
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating retrieved information. Standard retrieval process prioritized relevance, focusing on topical alignment between queries and passages. In contrast, in RAG, the emphasis has shifted to utility, which considers the usefulness of passages for generating accurate answers. Despite empirical evidence showing the benefits of utility-based retrieval in RAG, the high computational cost of using LLMs for utility judgments limits the number of passages evaluated. This restriction is problematic for complex queries requiring extensive information. To address this, we propose a method to distill the utility judgment capabilities of LLMs into smaller, more efficient models. Our approach focuses on utility-based selection rather than ranking, enabling dynamic passage selection tailored to specific queries without the need for fixed thresholds. We train student models to learn pseudo-answer generation and utility judgments from teacher LLMs, using a sliding window method that dynamically selects useful passages. Our experiments demonstrate that utility-based selection provides a flexible and cost-effective solution for RAG, significantly reducing computational costs while improving answer quality. We present the distillation results using Qwen3-32B as the teacher model for both relevance ranking and utility-based selection, distilled into RankQwen1.7B and UtilityQwen1.7B. Our findings indicate that for complex questions, utility-based selection is more effective than relevance ranking in enhancing answer generation performance. We will release the relevance ranking and utility-based selection annotations for the MS MARCO dataset, supporting further research in this area.
中文摘要:本研究提出一种将大语言模型的效用判断能力蒸馏到小模型的方法,用于检索增强生成,通过动态段落选择在降低计算成本的同时提升复杂问题的回答质量。
English Summary: This study introduces a method to distill large language models' utility judgment capabilities into smaller models for retrieval-augmented generation, enabling dynamic passage selection that reduces computational costs while improving answer quality for complex queries.
Authors:Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng
Abstract:
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating retrieved information. Standard retrieval process prioritized relevance, focusing on topical alignment between queries and passages. In contrast, in RAG, the emphasis has shifted to utility, which considers the usefulness of passages for generating accurate answers. Despite empirical evidence showing the benefits of utility-based retrieval in RAG, the high computational cost of using LLMs for utility judgments limits the number of passages evaluated. This restriction is problematic for complex queries requiring extensive information. To address this, we propose a method to distill the utility judgment capabilities of LLMs into smaller, more efficient models. Our approach focuses on utility-based selection rather than ranking, enabling dynamic passage selection tailored to specific queries without the need for fixed thresholds. We train student models to learn pseudo-answer generation and utility judgments from teacher LLMs, using a sliding window method that dynamically selects useful passages. Our experiments demonstrate that utility-based selection provides a flexible and cost-effective solution for RAG, significantly reducing computational costs while improving answer quality. We present the distillation results using Qwen3-32B as the teacher model for both relevance ranking and utility-based selection, distilled into RankQwen1.7B and UtilityQwen1.7B. Our findings indicate that for complex questions, utility-based selection is more effective than relevance ranking in enhancing answer generation performance. We will release the relevance ranking and utility-based selection annotations for the MS MARCO dataset, supporting further research in this area.
中文摘要:本研究提出一种将大语言模型的效用判断能力蒸馏到小模型的方法,用于检索增强生成,通过动态段落选择在降低计算成本的同时提升复杂问题的回答质量。
English Summary: This study introduces a method to distill large language models' utility judgment capabilities into smaller models for retrieval-augmented generation, enabling dynamic passage selection that reduces computational costs while improving answer quality for complex queries.
Authors:Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng
Abstract:
The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose $\textbf{RefineX}$, a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.
中文: RefineX是一种通过程序化编辑实现大规模细粒度预训练数据优化的创新框架,在保持文本多样性和自然性的同时,能持续在不同模型规模和下游任务中超越其他数据优化方法。
English: RefineX is a novel framework that enables large-scale, fine-grained refinement of pre-training data through programmatic editing, consistently outperforming other methods across model scales and downstream tasks while preserving text diversity and naturalness.
Authors:Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Jiazhang Cai, Huimin Cheng, Lin Tang, Ziyu Liu, Zeliang Sun, Tao Wang, Yingchuan Zhang, Arif Hassan Zidan, Jinwen Xu, Jincheng Yu, Meizhi Yu, Hanqi Jiang, Xilin Gong, Weidi Luo, Bolun Sun, Yongkai Chen, Terry Ma, Shushan Wu, Yifan Zhou, Junhao Chen, Haotian Xiang, Jing Zhang, Afrar Jahin, Wei Ruan, Ke Deng, Yi Pan, Peilong Wang, Jiahui Li, Zhengliang Liu, Lu Zhang, Lin Zhao, Wei Liu, Dajiang Zhu, Xin Xing, Fei Dou, Wei Zhang, Chao Huang, Rongjie Liu, Mengrui Zhang, Yiwen Liu, Xiaoxiao Sun, Qin Lu, Zhen Xiang, Wenxuan Zhong, Tianming Liu, Ping Ma
Abstract:
Due to the remarkable capabilities and growing impact of large language models (LLMs), they have been deeply integrated into many aspects of society. Thus, ensuring their alignment with human values and intentions has emerged as a critical challenge. This survey provides a comprehensive overview of practical alignment techniques, training protocols, and empirical findings in LLM alignment. We analyze the development of alignment methods across diverse paradigms, characterizing the fundamental trade-offs between core alignment objectives. Our analysis shows that while supervised fine-tuning enables basic instruction-following, preference-based methods offer more flexibility for aligning with nuanced human intent. We discuss state-of-the-art techniques, including Direct Preference Optimization (DPO), Constitutional AI, brain-inspired methods, and alignment uncertainty quantification (AUQ), highlighting their approaches to balancing quality and efficiency. We review existing evaluation frameworks and benchmarking datasets, emphasizing limitations such as reward misspecification, distributional robustness, and scalable oversight. We summarize strategies adopted by leading AI labs to illustrate the current state of practice. We conclude by outlining open problems in oversight, value pluralism, robustness, and continuous alignment. This survey aims to inform both researchers and practitioners navigating the evolving landscape of LLM alignment.
Chinese: 本综述全面探讨了大语言模型与人类价值观对齐的实用技术、训练方案及实证发现,重点分析了直接偏好优化等关键方法,并指出了奖励设定偏差和可扩展监督等挑战。
English: This survey comprehensively examines the practical techniques, training protocols, and empirical findings in aligning large language models with human values, highlighting key methods like Direct Preference Optimization and addressing challenges such as reward misspecification and scalable oversight.
Authors:Sohan Shankar, Yi Pan, Hanqi Jiang, Zhengliang Liu, Mohammad R. Darbandi, Agustin Lorenzo, Junhao Chen, Md Mehedi Hasan, Arif Hassan Zidan, Eliana Gelman, Joshua A. Konfrst, Jillian Y. Russell, Katelyn Fernandes, Tianze Yang, Yiwei Li, Huaqin Zhao, Afrar Jahin, Triparna Ganguly, Shair Dinesha, Yifan Zhou, Zihao Wu, Xinliang Li, Lokesh Adusumilli, Aziza Hussein, Sagar Nookarapu, Jixin Hou, Kun Jiang, Jiaxi Li, Brenden Heinel, XianShen Xi, Hailey Hubbard, Zayna Khan, Levi Whitaker, Ivan Cao, Max Allgaier, Andrew Darby, Lin Zhao, Lu Zhang, Xiaoqiao Wang, Xiang Li, Wei Zhang, Xiaowei Yu, Dajiang Zhu, Yohannes Abate, Tianming Liu
Abstract:
This position and survey paper identifies the emerging convergence of neuroscience, artificial general intelligence (AGI), and neuromorphic computing toward a unified research paradigm. Using a framework grounded in brain physiology, we highlight how synaptic plasticity, sparse spike-based communication, and multimodal association provide design principles for next-generation AGI systems that potentially combine both human and machine intelligences. The review traces this evolution from early connectionist models to state-of-the-art large language models, demonstrating how key innovations like transformer attention, foundation-model pre-training, and multi-agent architectures mirror neurobiological processes like cortical mechanisms, working memory, and episodic consolidation. We then discuss emerging physical substrates capable of breaking the von Neumann bottleneck to achieve brain-scale efficiency in silicon: memristive crossbars, in-memory compute arrays, and emerging quantum and photonic devices. There are four critical challenges at this intersection: 1) integrating spiking dynamics with foundation models, 2) maintaining lifelong plasticity without catastrophic forgetting, 3) unifying language with sensorimotor learning in embodied agents, and 4) enforcing ethical safeguards in advanced neuromorphic autonomous systems. This combined perspective across neuroscience, computation, and hardware offers an integrative agenda for in each of these fields.
本文概述了神经科学、通用人工智能与神经形态计算的融合趋势,提出了基于大脑机制的新一代智能系统设计原则,并指出了该领域发展中的关键挑战。
This paper outlines the convergence of neuroscience, AGI, and neuromorphic computing, proposing brain-inspired design principles for next-generation intelligent systems and identifying key challenges in their development.
Authors:Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang
Abstract:
Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, enabling them to solve complex tasks. However, their ability to proactively collaborate with users, especially when goals are vague, evolving, or indirectly expressed, remains underexplored. To address this gap, we introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions. UserBench features simulated users who start with underspecified goals and reveal preferences incrementally, requiring agents to proactively clarify intent and make grounded decisions with tools. Our evaluation of leading open- and closed-source LLMs reveals a significant disconnect between task completion and user alignment. For instance, models provide answers that fully align with all user intents only 20% of the time on average, and even the most advanced models uncover fewer than 30% of all user preferences through active interaction. These results highlight the challenges of building agents that are not just capable task executors, but true collaborative partners. UserBench offers an interactive environment to measure and advance this critical capability.
中文:UserBench作为以用户为中心的基准,通过模拟目标模糊且偏好逐步展现的交互环境,评估智能体主动协作能力,发现现有模型虽能完成任务但用户意图对齐度不足,揭示了构建真正协作伙伴的挑战。
English: UserBench is a new benchmark designed to evaluate LLM agents' ability to proactively collaborate with users by clarifying vague goals and incrementally revealed preferences, revealing that current models struggle with user alignment despite strong task performance.
Authors:Qian Dong, Jia Chen, Qingyao Ai, Hongning Wang, Haitao Li, Yi Wu, Yao Hu, Yiqun Liu, Shaoping Ma
Abstract:
Existing retrieval-augmented code generation (RACG) methods typically use an external retrieval module to fetch semantically similar code snippets used for generating subsequent fragments. However, even for consecutive code fragments, the content often diverges due to logical progression, resulting in a content gap. This gap undermines the performance of current RACG methods, as \textit{external} retrieval modules based on content matching fail to infer the specific information need of LLMs to generate the next code fragment. Therefore, we propose \textbf{SelfRACG}, a novel paradigm that enables large language models (LLMs) to \textbf{Self}-express their information needs to enhance \textbf{RACG}. Specifically, SelfRACG includes an information need expression module and a two-stage information need-guided training strategy, which encourages LLMs to express their information need. Extensive experiments demonstrate that SelfRACG can retrieve external knowledge that better aligns with the LLM's own information needs, resulting in superior generation performance compared to vanilla RACG.
Chinese: 现有检索增强代码生成方法因检索内容与模型实际需求不匹配而产生内容断层,为此提出的SelfRACG框架让大语言模型能自主表达信息需求,从而实现更精准的知识检索和更优的生成性能。
English: Current retrieval-augmented code generation methods suffer from content gaps due to mismatches between retrieved snippets and the model's actual needs, prompting the proposed SelfRACG framework that enables LLMs to express their information requirements for improved alignment and performance.
Authors:Qian Dong, Jia Chen, Qingyao Ai, Hongning Wang, Haitao Li, Yi Wu, Yao Hu, Yiqun Liu, Shaoping Ma
Abstract:
Existing retrieval-augmented code generation (RACG) methods typically use an external retrieval module to fetch semantically similar code snippets used for generating subsequent fragments. However, even for consecutive code fragments, the content often diverges due to logical progression, resulting in a content gap. This gap undermines the performance of current RACG methods, as \textit{external} retrieval modules based on content matching fail to infer the specific information need of LLMs to generate the next code fragment. Therefore, we propose \textbf{SelfRACG}, a novel paradigm that enables large language models (LLMs) to \textbf{Self}-express their information needs to enhance \textbf{RACG}. Specifically, SelfRACG includes an information need expression module and a two-stage information need-guided training strategy, which encourages LLMs to express their information need. Extensive experiments demonstrate that SelfRACG can retrieve external knowledge that better aligns with the LLM's own information needs, resulting in superior generation performance compared to vanilla RACG.
Chinese: 现有检索增强代码生成方法因检索内容与模型实际需求不匹配而产生内容断层,为此提出的SelfRACG框架让大语言模型能自主表达信息需求,从而实现更精准的知识检索和更优的生成性能。
English: Current retrieval-augmented code generation methods suffer from content gaps due to mismatches between retrieved snippets and the model's actual needs, prompting the proposed SelfRACG framework that enables LLMs to express their information requirements for improved alignment and performance.
Authors:Shilong Zhao, Fei Sun, Kaike Zhang, Shaoling Jing, Du Su, Zhichao Shi, Zhiyi Yin, Huawei Shen, Xueqi Cheng
Abstract:
Recent studies have demonstrated the vulnerability of sequential recommender systems to Model Extraction Attacks (MEAs). MEAs collect responses from recommender systems to replicate their functionality, enabling unauthorized deployments and posing critical privacy and security risks. Black-box attacks in prior MEAs are ineffective at exposing recommender system vulnerabilities due to random sampling in data selection, which leads to misaligned synthetic and real-world distributions. To overcome this limitation, we propose LLM4MEA, a novel model extraction method that leverages Large Language Models (LLMs) as human-like rankers to generate data. It generates data through interactions between the LLM ranker and target recommender system. In each interaction, the LLM ranker analyzes historical interactions to understand user behavior, and selects items from recommendations with consistent preferences to extend the interaction history, which serves as training data for MEA. Extensive experiments demonstrate that LLM4MEA significantly outperforms existing approaches in data quality and attack performance, reducing the divergence between synthetic and real-world data by up to 64.98% and improving MEA performance by 44.82% on average. From a defensive perspective, we propose a simple yet effective defense strategy and identify key hyperparameters of recommender systems that can mitigate the risk of MEAs.
Chinese: 近期研究表明,序列推荐系统易受模型提取攻击,而提出的LLM4MEA方法利用大型语言模型模拟人类排序器生成高质量数据,显著提升攻击性能,同时将合成数据与真实数据分布差异降低高达64.98%。
English: Recent studies reveal that sequential recommender systems are vulnerable to Model Extraction Attacks (MEAs), and the proposed LLM4MEA method leverages Large Language Models as human-like rankers to generate high-quality data, significantly improving attack performance while reducing distribution divergence by up to 64.98%.
Authors:Congmin Zheng, Jiachen Zhu, Jianghao Lin, Xinyi Dai, Yong Yu, Weinan Zhang, Mengyue Yang
Abstract:
Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD consistently reduces reward-length correlation, improves accuracy in step selection, and encourages more concise, logically valid reasoning. These results demonstrate the effectiveness and practicality of CoLD in improving the fidelity and robustness of PRMs.
中文: 本文发现过程奖励模型存在显著的长度偏见,即倾向于给更长的推理步骤更高评分,并提出CoLD框架通过反事实引导的长度去偏方法,有效减少该偏见,从而提升数学问题解决中奖励预测的准确性和推理过程的简洁有效性。
English: This paper identifies a significant length bias in Process Reward Models (PRMs) that favors longer reasoning steps regardless of semantic quality, and introduces CoLD, a counterfactually-guided debiasing framework that effectively reduces this bias to enhance reward accuracy and promote more concise, valid reasoning in mathematical problem-solving.
Authors:Rithesh Murthy, Ming Zhu, Liangwei Yang, Jielin Qiu, Juntao Tan, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang
Abstract:
Large Language Models (LLMs) perform best with well-crafted prompts, yet prompt engineering remains manual, inconsistent, and inaccessible to non-experts. We introduce Promptomatix, an automatic prompt optimization framework that transforms natural language task descriptions into high-quality prompts without requiring manual tuning or domain expertise. Promptomatix supports both a lightweight meta-prompt-based optimizer and a DSPy-powered compiler, with modular design enabling future extension to more advanced frameworks. The system analyzes user intent, generates synthetic training data, selects prompting strategies, and refines prompts using cost-aware objectives. Evaluated across 5 task categories, Promptomatix achieves competitive or superior performance compared to existing libraries, while reducing prompt length and computational overhead making prompt optimization scalable and efficient.
中文摘要:Promptomatix是一个自动提示优化框架,能将自然语言任务描述转化为高质量提示,无需人工调整即可在多类任务中实现优异性能,同时降低计算开销。
English Summary: Promptomatix is an automated framework that converts natural language task descriptions into optimized prompts, eliminating manual tuning while achieving competitive performance with reduced computational costs across diverse tasks.
Authors:Yiran Hu, Zongyue Xue, Haitao Li, Siyuan Zheng, Qingjing Chen, Shaochun Wang, Xihan Zhang, Ning Zheng, Yun Liu, Qingyao Ai, Yiqun Liu, Charles L. A. Clarke, Weixing Shen
Abstract:
Large Language Models (LLMs) are increasingly used in high-stakes fields where their decisions impact rights and equity. However, LLMs' judicial fairness and implications for social justice remain underexplored. When LLMs act as judges, the ability to fairly resolve judicial issues is a prerequisite to ensure their trustworthiness. Based on theories of judicial fairness, we construct a comprehensive framework to measure LLM fairness, leading to a selection of 65 labels and 161 corresponding values. Applying this framework to the judicial system, we compile an extensive dataset, JudiFair, comprising 177,100 unique case facts. To achieve robust statistical inference, we develop three evaluation metrics, inconsistency, bias, and imbalanced inaccuracy, and introduce a method to assess the overall fairness of multiple LLMs across various labels. Through experiments with 16 LLMs, we uncover pervasive inconsistency, bias, and imbalanced inaccuracy across models, underscoring severe LLM judicial unfairness. Particularly, LLMs display notably more pronounced biases on demographic labels, with slightly less bias on substance labels compared to procedure ones. Interestingly, increased inconsistency correlates with reduced biases, but more accurate predictions exacerbate biases. While we find that adjusting the temperature parameter can influence LLM fairness, model size, release date, and country of origin do not exhibit significant effects on judicial fairness. Accordingly, we introduce a publicly available toolkit containing all datasets and code, designed to support future research in evaluating and improving LLM fairness.
中文: 本研究构建了一个评估大语言模型司法公平性的综合框架,通过对16个模型的测试发现普遍存在不一致性、偏见和失衡误差,尤其在人口统计标签上更为显著,并提供了公开工具包以支持未来公平性研究。
English: This study develops a comprehensive framework to evaluate the judicial fairness of Large Language Models (LLMs), revealing significant inconsistencies, biases, and imbalanced inaccuracies across 16 tested models, particularly regarding demographic factors, and provides an open toolkit to support future fairness research.
Authors:Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Savarese, Caiming Xiong, Junnan Li
Abstract:
Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely GTA1. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, GTA1 achieves state-of-the-art performance on both grounding and agent task execution benchmarks. The code and models are released here.
中文: 图形用户界面(GUI)代理通过将用户指令分解为动作序列并与视觉元素交互来自主完成任务,但面临任务规划模糊性和动作精准定位的挑战,本文通过测试时扩展策略选择最优动作,并利用强化学习实现精确的视觉目标交互来解决这些问题。
English: GUI agents autonomously perform tasks by decomposing user instructions into action sequences and interacting with visual elements, but face challenges in task planning ambiguity and accurate action grounding, which are addressed through test-time scaling for optimal action selection and reinforcement learning for precise visual targeting.
Authors:Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, Jun Zhu
Abstract:
Imitation learning for robotic manipulation faces a fundamental challenge: the scarcity of large-scale, high-quality robot demonstration data. Recent robotic foundation models often pre-train on cross-embodiment robot datasets to increase data scale, while they face significant limitations as the diverse morphologies and action spaces across different robot embodiments make unified training challenging. In this paper, we present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies and can benefit robotic policy learning. We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders. Built on a diffusion transformer architecture with 2B parameters, H-RDT uses flow matching to model complex action distributions. Extensive evaluations encompassing both simulation and real-world experiments, single-task and multitask scenarios, as well as few-shot learning and robustness assessments, demonstrate that H-RDT outperforms training from scratch and existing state-of-the-art methods, including Pi0 and RDT, achieving significant improvements of 13.9% and 40.5% over training from scratch in simulation and real-world experiments, respectively. The results validate our core hypothesis that human manipulation data can serve as a powerful foundation for learning bimanual robotic manipulation policies.
中文摘要:H-RDT是一种创新方法,通过利用大规模人类操作数据构建双阶段训练范式,有效解决了机器人模仿学习的数据稀缺问题,在仿真和现实实验中均显著优于现有最佳方法。
English Summary: H-RDT is a novel approach that leverages large-scale human manipulation data to overcome robotic imitation learning limitations, achieving superior performance through a two-stage training paradigm with significant improvements over existing methods.
Authors:Yunrui Yu, Kafeng Wang, Hang Su, Jun Zhu
Abstract:
Despite their widespread success, deep neural networks remain critically vulnerable to adversarial attacks, posing significant risks in safety-sensitive applications. This paper investigates activation functions as a crucial yet underexplored component for enhancing model robustness. We propose a Rademacher Complexity Reduction Activation Function (RCR-AF), a novel activation function designed to improve both generalization and adversarial resilience. RCR-AF uniquely combines the advantages of GELU (including smoothness, gradient stability, and negative information retention) with ReLU's desirable monotonicity, while simultaneously controlling both model sparsity and capacity through built-in clipping mechanisms governed by two hyperparameters, $α$ and $γ$. Our theoretical analysis, grounded in Rademacher complexity, demonstrates that these parameters directly modulate the model's Rademacher complexity, offering a principled approach to enhance robustness. Comprehensive empirical evaluations show that RCR-AF consistently outperforms widely-used alternatives (ReLU, GELU, and Swish) in both clean accuracy under standard training and in adversarial robustness within adversarial training paradigms.
中文摘要:本文提出RCR-AF激活函数,通过结合GELU的平滑性与ReLU的单调性,利用超参数控制模型复杂度,在保持精度的同时显著提升了神经网络对抗攻击的鲁棒性。
English Summary: This paper introduces RCR-AF, a novel activation function combining GELU's smoothness with ReLU's monotonicity to enhance neural network robustness by controlling model complexity through hyperparameters, demonstrating superior performance in both accuracy and adversarial resilience.
Authors:Yunrui Yu, Hang Su, Cheng-zhong Xu, Zhizhong Su, Jun Zhu
Abstract:
Gradient-based adversarial attacks using the Cross-Entropy (CE) loss often suffer from overestimation due to relative errors in gradient computation induced by floating-point arithmetic. This paper provides a rigorous theoretical analysis of these errors, conducting the first comprehensive study of floating-point computation errors in gradient-based attacks across four distinct scenarios: (i) unsuccessful untargeted attacks, (ii) successful untargeted attacks, (iii) unsuccessful targeted attacks, and (iv) successful targeted attacks. We establish theoretical foundations characterizing the behavior of relative numerical errors under different attack conditions, revealing previously unknown patterns in gradient computation instability, and identify floating-point underflow and rounding as key contributors. Building on this insight, we propose the Theoretical MIFPE (T-MIFPE) loss function, which incorporates an optimal scaling factor $T = t^*$ to minimize the impact of floating-point errors, thereby enhancing the accuracy of gradient computation in adversarial attacks. Extensive experiments on the MNIST, CIFAR-10, and CIFAR-100 datasets demonstrate that T-MIFPE outperforms existing loss functions, including CE, C\&W, DLR, and MIFPE, in terms of attack potency and robustness evaluation accuracy.
中文: 本文从理论上分析了基于梯度的对抗攻击中的浮点计算误差,并提出带有最优缩放因子的T-MIFPE损失函数来减轻这些误差,实验证明其在多个数据集上优于现有方法。
English: This paper theoretically analyzes floating-point computation errors in gradient-based adversarial attacks and introduces the T-MIFPE loss function with an optimal scaling factor to mitigate these errors, demonstrating superior performance across multiple datasets compared to existing methods.
Authors:Xiao Yang, Lingxuan Wu, Lizhong Wang, Chengyang Ying, Hang Su, Jun Zhu
Abstract:
Adversarial attacks in 3D environments have emerged as a critical threat to the reliability of visual perception systems, particularly in safety-sensitive applications such as identity verification and autonomous driving. These attacks employ adversarial patches and 3D objects to manipulate deep neural network (DNN) predictions by exploiting vulnerabilities within complex scenes. Existing defense mechanisms, such as adversarial training and purification, primarily employ passive strategies to enhance robustness. However, these approaches often rely on pre-defined assumptions about adversarial tactics, limiting their adaptability in dynamic 3D settings. To address these challenges, we introduce Reinforced Embodied Active Defense (Rein-EAD), a proactive defense framework that leverages adaptive exploration and interaction with the environment to improve perception robustness in 3D adversarial contexts. By implementing a multi-step objective that balances immediate prediction accuracy with predictive entropy minimization, Rein-EAD optimizes defense strategies over a multi-step horizon. Additionally, Rein-EAD involves an uncertainty-oriented reward-shaping mechanism that facilitates efficient policy updates, thereby reducing computational overhead and supporting real-world applicability without the need for differentiable environments. Comprehensive experiments validate the effectiveness of Rein-EAD, demonstrating a substantial reduction in attack success rates while preserving standard accuracy across diverse tasks. Notably, Rein-EAD exhibits robust generalization to unseen and adaptive attacks, making it suitable for real-world complex tasks, including 3D object classification, face recognition and autonomous driving.
中文: Rein-EAD框架提出了一种主动防御策略,通过自适应探索和环境交互来增强三维感知系统对抗攻击的鲁棒性,显著降低攻击成功率,并在多种任务中保持准确性。
English: The Rein-EAD framework introduces a proactive defense strategy that uses adaptive exploration and environmental interaction to enhance 3D perception robustness against adversarial attacks, significantly reducing attack success while maintaining accuracy across various tasks.
Authors:Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, Jun Zhu
Abstract:
Bimanual robotic manipulation, which involves the coordinated control of two robotic arms, is foundational for solving challenging tasks. Despite recent progress in general-purpose manipulation, data scarcity and embodiment heterogeneity remain serious obstacles to further scaling up in bimanual settings. In this paper, we introduce Video Diffusion for Action Reasoning (Vidar), a two-stage framework that leverages large-scale, diffusion-based video pre-training and a novel masked inverse dynamics model for action prediction. We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms, utilizing a unified observation space that encodes robot, camera, task, and scene contexts. Our masked inverse dynamics model learns masks to extract action-relevant information from generated trajectories without requiring pixel-level labels, and the masks can effectively generalize to unseen backgrounds. Our experiments demonstrate that with only 20 minutes of human demonstrations on an unseen robot platform (only 1% of typical data requirements), Vidar generalizes to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods. Our findings highlight the potential of video foundation models, coupled with masked action prediction, to enable scalable and generalizable robotic manipulation in diverse real-world settings.
中文摘要:Vidar提出了一种基于视频先验的低样本适应范式,通过迁移性视频知识替代大量机器人特定数据,仅需1%的常规演示量即可在新机器人上实现优于现有方法的操作性能。
English Summary: Vidar introduces a low-shot adaptation paradigm using transferable video priors to enable general-purpose manipulation across different robots with minimal new data, outperforming existing methods with only 1% of typical demonstration requirements.
Authors:Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, Jun Zhu
Abstract:
Scaling general-purpose manipulation to new robot embodiments remains challenging: each platform typically needs large, homogeneous demonstrations, and pixel-to-action VLA pipelines typically degenerate under background and viewpoint shifts. In this paper, we present Vidar, a prior-driven, low-shot adaptation paradigm that replaces most embodiment-specific data with transferable video priors. Vidar consists of an embodied video diffusion model as the generalizable prior and a masked inverse dynamics model (MIDM) adapter based on a key decoupling of the policy. The embodied diffusion model is pre-trained on Internet-scale videos and then domain-adapted to 750K multi-view trajectories from three real-world robot platforms using a unified observation space encoding robot, camera, task, and scene contexts. The MIDM module learns action-relevant pixel masks without dense labels, grounding the prior into the target embodiment's action space while suppressing distractors. Crucially, the generative video prior models the distribution of plausible, temporally coherent interactions, implicitly capturing affordances, contact dynamics, and physical consistency from massive unlabeled video. This shifts the challenge from collecting large amounts of new robot data to efficiently aligning a rich prior with a new embodiment. With only 20 minutes of human demonstrations on an unseen robot (1% of typical data), Vidar outperforms state-of-the-art VLA baselines and generalizes to unseen tasks, backgrounds, and camera layouts. Our results suggest a scalable recipe for "one prior, many embodiments": strong, inexpensive video priors + minimal on-robot alignment.
中文摘要:Vidar提出了一种基于视频先验的低样本适应范式,通过迁移性视频知识替代大量机器人特定数据,仅需1%的常规演示量即可在新机器人上实现优于现有方法的操作性能。
English Summary: Vidar introduces a low-shot adaptation paradigm using transferable video priors to enable general-purpose manipulation across different robots with minimal new data, outperforming existing methods with only 1% of typical demonstration requirements.
Authors:Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, Jun Zhu
Abstract:
Vision-language-action (VLA) models have shown promise on task-conditioned control in complex settings such as bimanual manipulation. However, the heavy reliance on task-specific human demonstrations limits their generalization and incurs high data acquisition costs. In this work, we present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning, enhancing scalability, efficiency, and cost-effectiveness. To address the data collection challenges posed by this paradigm -- such as low coverage density, behavioral redundancy, and safety risks -- we introduce ATARA (Automated Task-Agnostic Random Actions), a scalable self-supervised framework that accelerates collection by over $ 30\times $ compared to human teleoperation. To further enable effective learning from task-agnostic data, which often suffers from distribution mismatch and irrelevant trajectories, we propose AnyPos, an inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder (DAD). We additionally integrate a video-conditioned action validation module to verify the feasibility of learned policies across diverse manipulation tasks. Extensive experiments show that the AnyPos-ATARA pipeline yields a 51% improvement in test accuracy and achieves 30-40% higher success rates in downstream tasks such as lifting, pick-and-place, and clicking, using replay-based video validation. Project Page: https://embodiedfoundation.github.io/vidar_anypos
Chinese Summary: 本研究提出了ATARA框架,通过自监督方式将任务无关动作数据收集效率提升30倍以上,并开发AnyPos逆动力学模型优化从该数据中的学习效果,在多类操作任务中实现了性能的显著提升。
English Summary: This research introduces ATARA, a scalable self-supervised framework that accelerates task-agnostic action data collection by over 30 times, and AnyPos, an inverse dynamics model that enhances learning efficiency from such data, achieving significant improvements in manipulation task performance.
Authors:Xiaozheng Gao, Yichen Wang, Bosen Liu, Xiao Zhou, Ruichen Zhang, Jiacheng Wang, Dusit Niyato, Dong In Kim, Abbas Jamalipour, Chau Yuen, Jianping An, Kai Yang
Abstract:
The development of satellite-augmented low-altitude economy and terrestrial networks (SLAETNs) demands intelligent and autonomous systems that can operate reliably across heterogeneous, dynamic, and mission-critical environments. To address these challenges, this survey focuses on enabling agentic artificial intelligence (AI), that is, artificial agents capable of perceiving, reasoning, and acting, through generative AI (GAI) and large language models (LLMs). We begin by introducing the architecture and characteristics of SLAETNs, and analyzing the challenges that arise in integrating satellite, aerial, and terrestrial components. Then, we present a model-driven foundation by systematically reviewing five major categories of generative models: variational autoencoders (VAEs), generative adversarial networks (GANs), generative diffusion models (GDMs), transformer-based models (TBMs), and LLMs. Moreover, we provide a comparative analysis to highlight their generative mechanisms, capabilities, and deployment trade-offs within SLAETNs. Building on this foundation, we examine how these models empower agentic functions across three domains: communication enhancement, security and privacy protection, and intelligent satellite tasks. Finally, we outline key future directions for building scalable, adaptive, and trustworthy generative agents in SLAETNs. This survey aims to provide a unified understanding and actionable reference for advancing agentic AI in next-generation integrated networks.
中文摘要:本综述探讨了如何通过生成式AI和大语言模型开发自主智能体,以应对卫星增强低空经济与地面网络中的异构动态环境挑战,重点提升通信、安全及智能任务处理能力。
English Summary: This survey explores how generative AI and large language models can enable autonomous artificial agents to address the challenges of satellite-augmented low-altitude economy and terrestrial networks by enhancing communications, security, and intelligent tasks.
Authors:Geng Sun, Chenbang Liu, Jiahui Li, Guannan Qu, Shuang Liang, Jiacheng Wang, Changyuan Zhao, Dusit Niyato
Abstract:
Unmanned aerial vehicle (UAV) swarms utilizing collaborative beamforming (CB) in low-altitude wireless networks (LAWN) demonstrate significant potential for enhanced communication range, energy efficiency, and signal directivity through the formation of virtual antenna arrays (VAA). However, environmental disturbances, particularly wind fields, significantly degrade CB performance by introducing positional errors that disrupt beam patterns, thereby compromising transmission reliability. This paper investigates the critical challenge of maintaining CB performance in UAV-based VAAs operating in LAWN under wind field disturbances. We propose a comprehensive framework that models the impact of three distinct wind conditions (constant, shear, and turbulent) on UAV array performance, and formulate a long-term real-time optimization problem to maximize directivity while minimizing maximum sidelobe levels through adaptive excitation current weight adjustments. To address the inherent complexity of this problem, we propose a novel proximal policy optimization algorithm with long short-term memory (LSTM) structure and adaptive learning rate (PPO-LA), which effectively captures temporal patterns in wind field disturbances and enables real-time adaptation without requiring extensive prior training for specific wind conditions. Our simulation results demonstrate that the proposed PPO-LA algorithm successfully recovers degraded CB performance across various wind scenarios, and thus significantly outperforming benchmark algorithms.
中文: 本文提出新型PPO-LA算法,通过自适应调整激励电流有效应对风场干扰下无人机集群协同波束成形的性能衰减问题,在各种风场环境中显著优于现有基准算法。
English: This paper introduces a novel PPO-LA algorithm that effectively mitigates wind-induced performance degradation in UAV swarm collaborative beamforming by adaptively adjusting excitation currents, significantly outperforming existing methods across diverse wind conditions.
Authors:Wen Zhang, Aimin Wang, Jiahui Li, Geng Sun, Jiacheng Wang, Weijie Yuan, Dusit Niyato
Abstract:
The integration of wireless power transfer (WPT) with Internet of Things (IoT) offers promising solutions for sensing applications, but faces significant challenges when deployed in hard-to-access areas such as high-temperature environments. In such extreme conditions, traditional fixed WPT infrastructure cannot be safely installed, and batteries rapidly degrade due to hardware failures. In this paper, we propose an uncrewed aerial vehicle (UAV)-assisted data collection and WPT framework for batteryless sensor (BLS) networks deployed in these challenging environments. Specifically, we consider a practical scenario where a UAV first transfers energy to BLS nodes via WPT, enabling these nodes to subsequently transmit their collected data to the UAV through orthogonal frequency-division multiple access (OFDMA). Then, we formulate a multi-objective optimization problem that aims to maximize the fair data collection volume while minimizing the UAV energy consumption through joint optimization of transmit power allocation and flight trajectory planning. Due to the non-convex nature and dynamic characteristics of this problem, conventional optimization methods prove inadequate. To address these challenges, we propose an enhanced soft actor-critic algorithm with parameter-free attention, prioritized experience replay, and value-based reward centering (SAC-PPV), thereby improving the exploration efficiency and learning stability of the algorithm in complex WPT scenarios. Simulation results demonstrate that the proposed approach consistently outperforms benchmark algorithms under various network configurations.
中文: 本文提出了一种无人机辅助的无线能量传输与数据收集框架,用于极端环境下的无电池传感器网络,通过采用改进的SAC-PPV算法进行多目标优化,有效提升了系统效率并优于现有基准算法。
English: This paper introduces a UAV-assisted framework for wireless power transfer and data collection in batteryless sensor networks within extreme environments, employing a multi-objective optimization approach enhanced by the SAC-PPV algorithm to improve efficiency and performance over existing methods.
Authors:Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, Jiwen Lu
Abstract:
Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE's tolerance threshold. We present \textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbf{channel multi-group quantization} to enlarge codebook capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over two orders of magnitude relative to state-of-the-art approaches: ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours, whereas comparable methods require 4.5 days on 32 A100 GPUs. Experimental results show that ReVQ achieves superior efficiency-reconstruction trade-offs.
中文: 本研究提出ReVQ框架,通过控制量化噪声并采用通道多组量化技术,将预训练VAE高效转化为VQ-VAE,在保持高质量图像重建的同时,将训练时间从数千GPU小时大幅缩短至单卡22小时。
English: This work introduces ReVQ, a framework that efficiently converts pre-trained VAEs into VQ-VAEs by managing quantization noise and using channel multi-group quantization, drastically reducing training time from thousands of GPU hours to just 22 hours on a single GPU while maintaining high image reconstruction quality.
Authors:Erxue Min, Hsiu-Yuan Huang, Xihong Yang, Min Yang, Xin Jia, Yunfang Wu, Hengyi Cai, Junfeng Wang, Shuaiqiang Wang, Dawei Yin
Abstract:
Generating effective query suggestions in conversational search requires aligning model outputs with user preferences, which is challenging due to sparse and noisy click signals. We propose GQS, a generative framework that integrates click modeling and preference optimization to enhance real-world user engagement. GQS consists of three key components: (1) a Multi-Source CTR Modeling module that captures diverse contextual signals to estimate fine-grained click-through rates; (2) a Diversity-Aware Preference Alignment strategy using CTR-weighted Direct Preference Optimization (DPO), which balances relevance and semantic diversity; and (3) a CTR-Calibrated Iterative Optimization process that jointly refines the CTR and generation models across training rounds. Experiments on two real-world tasks demonstrate that GQS outperforms strong baselines in CTR, relevance, and diversity.
中文: GQS是一个生成式框架,通过整合点击建模与偏好优化,利用多源点击率建模捕捉上下文信号,并采用CTR加权的直接偏好优化策略平衡相关性与多样性,从而有效提升用户参与度。
English: GQS is a generative framework that integrates click modeling and preference optimization to improve user engagement by capturing diverse contextual signals and balancing relevance with diversity through CTR-weighted DPO and iterative refinement.
Authors:Haoran Chen, Zexiao Wang, Haidong Cao, Zuxuan Wu, Yu-Gang Jiang
Abstract:
Large Vision-Language Models like CLIP have become a powerful foundation for Unsupervised Domain Adaptation due to their strong zero-shot generalization. State-of-the-art methods typically leverage CLIP to generate pseudo-labels for the target domain, then fine-tune the model to learn domain-invariant features. However, these methods attempt to align source and target domains using all pseudo-labeled data simultaneously. This one-shot alignment struggles with noisy, hard-to-classify samples, leading to error propagation and suboptimal feature learning. The problem is even more amplified in the multi-source scenario, where diverse domain gaps and varying noise levels across multiple source domains further destabilize the alignment process. To address this issue, in this work, we propose a progressive alignment strategy for adapting CLIP to unlabeled downstream task. Our method begins by training the model on a high-confidence subset of target samples, allowing it to first learn a well-aligned representation from the most reliable data. As training progresses, it gradually incorporates more challenging samples, guiding the model to refine its understanding without being overwhelmed by initial label noise. This progressive approach effectively mitigates confirmation bias and promotes a more robust convergence, allowing for the learning of genuinely domain-invariant features. We name our approach MP^2A and test it on three popular UDA benchmarks, namely ImageCLEF, Office-Home, and the most challenging DomainNet. Experiments showcase that MP^2A achieves state-of-the-art performance when compared with most recent CLIP-based MS-UDA approaches, demonstrating the effectiveness of our approach.
Chinese: 本文提出MP^2A渐进对齐策略,通过先训练高置信度目标样本再逐步引入困难样本的方式,有效减少基于CLIP的无监督域适应中的误差传播,在多个基准测试中实现了最优性能。
English: This paper introduces MP^2A, a progressive alignment strategy that mitigates error propagation in CLIP-based unsupervised domain adaptation by initially training on high-confidence target samples before gradually incorporating challenging data, achieving state-of-the-art performance on multiple benchmarks.
Authors:Hao Xiang, Tianyi Tang, Yang Su, Bowen Yu, An Yang, Fei Huang, Yichang Zhang, Yaojie Lu, Hongyu Lin, Xianpei Han, Jingren Zhou, Junyang Lin, Le Sun
Abstract:
Recent advancements in Large Language Models (LLMs) have shown outstanding potential for role-playing applications. Evaluating these capabilities is becoming crucial yet remains challenging. Existing benchmarks mostly adopt a \textbf{character-centric} approach, simplify user-character interactions to isolated Q&A tasks, and fail to reflect real-world applications. To address this limitation, we introduce RMTBench, a comprehensive \textbf{user-centric} bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds. RMTBench includes custom characters with detailed backgrounds and abstract characters defined by simple traits, enabling evaluation across various user scenarios. Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications. Furthermore, we construct an authentic multi-turn dialogue simulation mechanism. With carefully selected evaluation dimensions and LLM-based scoring, this mechanism captures the complex intention of conversations between the user and the character. By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements, offering a more effective framework for assessing role-playing capabilities in LLMs. All code and datasets will be released soon.
中文摘要:RMTBench是一个以用户为中心的双语角色扮演基准,通过80个多样化角色和超8000轮对话评估大语言模型,聚焦用户动机和多轮交互,更贴合实际应用需求。
English Summary: RMTBench is a user-centric bilingual benchmark designed to evaluate LLMs' role-playing abilities through 80 diverse characters and over 8,000 dialogue rounds, focusing on user motivations and multi-turn interactions to better align with real-world applications.
Authors:Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang
Abstract:
Current diffusion models for human image animation often struggle to maintain identity (ID) consistency, especially when the reference image and driving video differ significantly in body size or position. We introduce StableAnimator++, the first ID-preserving video diffusion framework with learnable pose alignment, capable of generating high-quality videos conditioned on a reference image and a pose sequence without any post-processing. Building upon a video diffusion model, StableAnimator++ contains carefully designed modules for both training and inference, striving for identity consistency. In particular, StableAnimator++ first uses learnable layers to predict the similarity transformation matrices between the reference image and the driven poses via injecting guidance from Singular Value Decomposition (SVD). These matrices align the driven poses with the reference image, mitigating misalignment to a great extent. StableAnimator++ then computes image and face embeddings using off-the-shelf encoders, refining the face embeddings via a global content-aware Face Encoder. To further maintain ID, we introduce a distribution-aware ID Adapter that counteracts interference caused by temporal layers while preserving ID via distribution alignment. During the inference stage, we propose a novel Hamilton-Jacobi-Bellman (HJB) based face optimization integrated into the denoising process, guiding the diffusion trajectory for enhanced facial fidelity. Experiments on benchmarks show the effectiveness of StableAnimator++ both qualitatively and quantitatively.
Chinese: StableAnimator++ 提出了一种具有可学习姿态对齐和身份保持模块的新型视频扩散框架,能够在不同姿态和体型条件下生成高质量人体动画并保持身份一致性。
English: StableAnimator++ introduces a novel video diffusion framework with learnable pose alignment and identity-preserving modules to generate high-quality human animations while maintaining identity consistency across varying poses and body sizes.
Authors:Qiaoyu Tang, Hao Xiang, Le Yu, Bowen Yu, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun, Junyang Lin
Abstract:
With the rapid advancement of Large Language Models (LLMs), developing effective critic modules for precise guidance has become crucial yet challenging. In this paper, we initially demonstrate that supervised fine-tuning for building critic modules (which is widely adopted in current solutions) fails to genuinely enhance models' critique abilities, producing superficial critiques with insufficient reflections and verifications. To unlock the unprecedented critique capabilities, we propose RefCritic, a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards: (1) instance-level correctness of solution judgments and (2) refinement accuracies of the policy model based on critiques, aiming to generate high-quality evaluations with actionable feedback that effectively guides model refinement. We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks. On critique and refinement settings, RefCritic demonstrates consistent advantages across all benchmarks, e.g., 6.8\% and 7.2\% gains on AIME25 for the respective base models. Notably, under majority voting, policy models filtered by RefCritic show superior scaling with increased voting numbers. Moreover, despite training on solution-level supervision, RefCritic outperforms step-level supervised approaches on ProcessBench, a benchmark to identify erroneous steps in mathematical reasoning.
随着大语言模型的快速发展,构建有效的批判模块变得至关重要,本文提出RefCritic,一种基于强化学习的双奖励机制方法,在多个基准测试中显著提升了批判质量和模型优化效果。
The rapid progress of Large Language Models necessitates effective critic modules, and this paper introduces RefCritic, a reinforcement learning-based approach with dual rewards that significantly enhances critique quality and model refinement across multiple benchmarks.
Authors:Jiaao Li, Kaiyuan Li, Chen Gao, Yong Li, Xinlei Chen
Abstract:
Egomotion videos are first-person recordings where the view changes continuously due to the agent's movement. As they serve as the primary visual input for embodied AI agents, making egomotion video reasoning more efficient is therefore essential for real-world deployment. Recent advances in vision-language models have enabled strong multimodal reasoning capabilities, but their computational cost remains prohibitive for long, redundant video inputs. Existing token pruning methods, typically designed for third-person videos, fail to leverage the spatiotemporal continuity and motion constraints inherent in egomotion settings. To address this, we propose EgoPrune, a training-free token pruning method tailored for egomotion video reasoning. EgoPrune comprises three components: a keyframe selector adapted from EmbodiedR for temporally efficient sampling; Perspective-Aware Redundancy Filtering (PARF), which aligns visual tokens using perspective transformations and removes redundant tokens; and a Maximal Marginal Relevance (MMR)-based token selector that jointly considers visual-text relevance and intra-frame diversity. Experiments on two egomotion video benchmarks show that EgoPrune consistently outperforms prior training-free methods across various pruning ratios while significantly reducing FLOPs, memory usage, and latency. Moreover, we deploy EgoPrune on an embodied agent equipped with a Jetson Orin NX 16GB edge device, demonstrating its real-world efficiency and suitability for on-device egomotion video reasoning.
中文: EgoPrune是一种专为自运动视频推理设计的免训练令牌剪枝方法,通过关键帧选择、视角感知过滤和多样性令牌选择,在显著降低计算资源消耗的同时保持性能,适用于边缘设备的实际部署。
English: EgoPrune is a training-free token pruning method designed for efficient egomotion video reasoning, incorporating keyframe selection, perspective-aware filtering, and diversity-based token selection to significantly reduce computational costs while maintaining performance across benchmarks and edge device deployments.
Authors:Baining Zhao, Rongze Tang, Mingyuan Jia, Ziyou Wang, Fanghang Man, Xin Zhang, Yu Shang, Weichen Zhang, Chen Gao, Wei Wu, Xin Wang, Xinlei Chen, Yong Li
Abstract:
How to enable robots to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore more general spatial imagination capabilities, here we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct an dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase training schedule to train a foundation model -- initially devoid of embodied spatial knowledge -- into a world model that is controllable by motion intentions and adheres to physical spatio-temporal constraints.
Chinese: AirScape是首个为六自由度空中智能体设计的世界模型,通过包含11,000个视频-意图对的数据集和两阶段训练,使其能够根据视觉输入和运动意图预测未来观测序列。
English: AirScape is the first world model for six-degree-of-freedom aerial agents, enabling them to predict future observations from visual inputs and motion intentions through a two-phase training process using a dataset of 11k video-intention pairs.
Authors:Liang Wang, Yu Rong, Tingyang Xu, Zhenyi Zhong, Zhiyuan Liu, Pengju Wang, Deli Zhao, Qiang Liu, Shu Wu, Liang Wang
Abstract:
Molecular structure elucidation from spectra is a foundational problem in chemistry, with profound implications for compound identification, synthesis, and drug development. Traditional methods rely heavily on expert interpretation and lack scalability. Pioneering machine learning methods have introduced retrieval-based strategies, but their reliance on finite libraries limits generalization to novel molecules. Generative models offer a promising alternative, yet most adopt autoregressive SMILES-based architectures that overlook 3D geometry and struggle to integrate diverse spectral modalities. In this work, we present DiffSpectra, a generative framework that directly infers both 2D and 3D molecular structures from multi-modal spectral data using diffusion models. DiffSpectra formulates structure elucidation as a conditional generation process. Its denoising network is parameterized by Diffusion Molecule Transformer, an SE(3)-equivariant architecture that integrates topological and geometric information. Conditioning is provided by SpecFormer, a transformer-based spectral encoder that captures intra- and inter-spectral dependencies from multi-modal spectra. Extensive experiments demonstrate that DiffSpectra achieves high accuracy in structure elucidation, recovering exact structures with 16.01% top-1 accuracy and 96.86% top-20 accuracy through sampling. The model benefits significantly from 3D geometric modeling, SpecFormer pre-training, and multi-modal conditioning. These results highlight the effectiveness of spectrum-conditioned diffusion modeling in addressing the challenge of molecular structure elucidation. To our knowledge, DiffSpectra is the first framework to unify multi-modal spectral reasoning and joint 2D/3D generative modeling for de novo molecular structure elucidation.
中文:DiffSpectra是一种生成框架,利用扩散模型直接从多模态光谱数据中推断分子的二维和三维结构,通过其SE(3)等变架构和光谱Transformer编码器实现了高精度结构解析。
English: DiffSpectra is a generative framework using diffusion models to directly infer both 2D and 3D molecular structures from multi-modal spectral data, achieving high accuracy through its SE(3)-equivariant architecture and spectral transformer encoder.
Authors:Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Xinwei He, Xiang Bai
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 15 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.
中文: 该摘要提出HV-MMBench基准,通过涵盖多样化任务、问题形式和时间跨度,全面评估多模态大语言模型在人类中心视频理解中的能力,弥补现有评估体系的不足。
English: This abstract introduces HV-MMBench, a comprehensive benchmark designed to address the limitations of existing human-centric video understanding evaluations by incorporating diverse tasks, question formats, and temporal ranges to holistically assess Multimodal Large Language Models' capabilities.
Authors:Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Chaoyou Fu, Xinwei He, Xiang Bai
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 13 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.
中文: 该摘要提出HV-MMBench基准,通过涵盖多样化任务、问题形式和时间跨度,全面评估多模态大语言模型在人类中心视频理解中的能力,弥补现有评估体系的不足。
English: This abstract introduces HV-MMBench, a comprehensive benchmark designed to address the limitations of existing human-centric video understanding evaluations by incorporating diverse tasks, question formats, and temporal ranges to holistically assess Multimodal Large Language Models' capabilities.
Authors:Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Han Feng, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Abstract:
Identity-preserving text-to-video (IPT2V) generation, which aims to create high-fidelity videos with consistent human identity, has become crucial for downstream applications. However, current end-to-end frameworks suffer a critical spatial-temporal trade-off: optimizing for spatially coherent layouts of key elements (e.g., character identity preservation) often compromises instruction-compliant temporal smoothness, while prioritizing dynamic realism risks disrupting the spatial coherence of visual structures. To tackle this issue, we propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics. Specifically, our paper proposes a semantic prompt optimization mechanism and stage-wise decoupled generation paradigm. The former module decouples the prompt into spatial and temporal components. Aligned with the subsequent stage-wise decoupled approach, the spatial prompts guide the text-to-image (T2I) stage to generate coherent spatial features, while the temporal prompts direct the sequential image-to-video (I2V) stage to ensure motion consistency. Experimental results validate that our approach achieves excellent spatiotemporal consistency, demonstrating outstanding performance in identity preservation, text relevance, and video quality. By leveraging this simple yet robust mechanism, our algorithm secures the runner-up position in 2025 ACM MultiMedia Challenge.
中文摘要:该研究提出的时空解耦框架通过将语义提示分解为空间和时间组件,分别指导文本到图像和图像到视频生成阶段,有效解决了身份保持视频生成中空间一致性与时序流畅性之间的权衡问题,在身份保持和视频质量方面表现出色。
English Summary: The proposed spatial-temporal decoupled framework overcomes the trade-off between spatial coherence and temporal smoothness in identity-preserving video generation by separating spatial and temporal prompts to guide text-to-image and image-to-video stages respectively, achieving superior performance in identity preservation and video quality.
Authors:Qingdong He, Xueqin Chen, Chaoyi Wang, Yanjie Pan, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang
Abstract:
Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and user intent. Additionally, current datasets provide limited support for training and evaluating reasoning-aware editing capabilities. Architecturally, these methods also lack mechanisms for fine-grained detail extraction that support such reasoning. To address these limitations, we propose Reason50K, a large-scale dataset specifically curated for training and evaluating hypothetical instruction reasoning image editing, along with ReasonBrain, a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios. Reason50K includes over 50K samples spanning four key reasoning scenarios: Physical, Temporal, Causal, and Story reasoning. ReasonBrain leverages Multimodal Large Language Models (MLLMs) for editing guidance generation and a diffusion model for image synthesis, incorporating a Fine-grained Reasoning Cue Extraction (FRCE) module to capture detailed visual and textual semantics essential for supporting instruction reasoning. To mitigate the semantic loss, we further introduce a Cross-Modal Enhancer (CME) that enables rich interactions between the fine-grained cues and MLLM-derived features. Extensive experiments demonstrate that ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios while exhibiting strong zero-shot generalization to conventional IIE tasks. Our dataset and code will be released publicly.
中文: 本文提出了用于假设性指令推理图像编辑的大规模数据集Reason50K和创新框架ReasonBrain,该框架通过多模态大语言模型和细粒度推理模块有效处理复杂隐含指令,在多种推理场景中均优于现有方法。
English: This paper introduces Reason50K, a large-scale dataset for hypothetical instruction reasoning in image editing, and ReasonBrain, a novel framework that uses multimodal large language models and fine-grained reasoning modules to effectively handle complex implicit instructions, outperforming existing methods across diverse reasoning scenarios.
Authors:Qingdong He, Xueqin Chen, Chaoyi Wang, Yanjie Pan, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang
Abstract:
Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and user intent. Additionally, current datasets provide limited support for training and evaluating reasoning-aware editing capabilities. Architecturally, these methods also lack mechanisms for fine-grained detail extraction that support such reasoning. To address these limitations, we propose Reason50K, a large-scale dataset specifically curated for training and evaluating hypothetical instruction reasoning image editing, along with ReasonBrain, a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios. Reason50K includes over 50K samples spanning four key reasoning scenarios: Physical, Temporal, Causal, and Story reasoning. ReasonBrain leverages Multimodal Large Language Models (MLLMs) for editing guidance generation and a diffusion model for image synthesis, incorporating a Fine-grained Reasoning Cue Extraction (FRCE) module to capture detailed visual and textual semantics essential for supporting instruction reasoning. To mitigate the semantic loss, we further introduce a Cross-Modal Enhancer (CME) that enables rich interactions between the fine-grained cues and MLLM-derived features. Extensive experiments demonstrate that ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios while exhibiting strong zero-shot generalization to conventional IIE tasks. Our dataset and code will be released publicly.
中文: 本文提出了用于假设性指令推理图像编辑的大规模数据集Reason50K和创新框架ReasonBrain,该框架通过多模态大语言模型和细粒度推理模块有效处理复杂隐含指令,在多种推理场景中均优于现有方法。
English: This paper introduces Reason50K, a large-scale dataset for hypothetical instruction reasoning in image editing, and ReasonBrain, a novel framework that uses multimodal large language models and fine-grained reasoning modules to effectively handle complex implicit instructions, outperforming existing methods across diverse reasoning scenarios.
Authors:Anna-Maria Halacheva, Jan-Nico Zaech, Sombit Dey, Luc Van Gool, Danda Pani Paudel
Abstract:
Real-world 3D scene-level scans offer realism and can enable better real-world generalizability for downstream applications. However, challenges such as data volume, diverse annotation formats, and tool compatibility limit their use. This paper demonstrates a methodology to effectively leverage these scans and their annotations. We propose a unified annotation integration using USD, with application-specific USD flavors. We identify challenges in utilizing holistic real-world scan datasets and present mitigation strategies. The efficacy of our approach is demonstrated through two downstream applications: LLM-based scene editing, enabling effective LLM understanding and adaptation of the data (80% success), and robotic simulation, achieving an 87% success rate in policy learning.
中文: 本文提出了一种基于USD的统一方法,有效整合真实世界3D扫描的标注数据,解决了数据和工具兼容性难题,并在LLM场景编辑(80%成功率)和机器人策略学习(87%成功率)中验证了其效能。
English: This paper introduces a unified USD-based method to integrate annotations from real-world 3D scans, overcoming data and tool limitations, and demonstrates its effectiveness with 80% success in LLM-based scene editing and 87% in robotic policy learning.
Authors:Tianyun Zhong, Guozhao Mo, Yanjiang Liu, Yihan Chen, Lingdi Kong, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Le Sun
Abstract:
With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://huggingface.co/datasets/tianyumyum/AOE.
中文:AOE基准测试旨在系统评估大语言模型从复杂文档中提取碎片化信息并组织成结构化表格的能力,实验表明即使最先进的模型在此任务上也面临显著困难。
English: The AOE benchmark is introduced to systematically evaluate large language models' ability to extract and organize fragmented information from complex documents into structured tables, revealing that even advanced models struggle with this task.
Authors:Anna-Maria Halacheva, Jan-Nico Zaech, Xi Wang, Danda Pani Paudel, Luc Van Gool
Abstract:
As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves performance of prior 3D VLM five folds, in out-of-the-domain settings.
中文: 本研究提出了一种以场景为中心的三维视觉语言模型,通过将语言特征直接嵌入三维高斯溅射表示,并采用双重稀疏化器生成紧凑的任务感知标记,在泛化能力上实现了比现有方法五倍的性能提升。
English: This study introduces a scene-centric 3D Vision-Language Model that embeds linguistic features directly into 3D Gaussian splatting representations, using a dual sparsifier to create compact task-aware tokens, achieving fivefold performance improvement over prior methods in generalization.
Authors:Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, Jinwen Luo, Weibo Gu, Zexuan Li, Xiaojing Zhang, Yangyu Tao, Han Hu, Di Wang, Ying Shan
Abstract:
Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.
中文摘要:本文提出ARC-Hunyuan-Video多模态模型,通过端到端处理视频的视觉、音频和文本信号,实现了对短视频的深度结构化理解,在真实场景应用中表现出色且具备高效推理能力。
English Summary: This paper introduces ARC-Hunyuan-Video, a 7B-parameter multimodal model that processes visual, audio, and textual inputs end-to-end to achieve comprehensive video understanding, demonstrating strong performance in real-world applications with efficient inference capabilities.
Authors:Wei Yuan, Chaoqun Yang, Yu Xing, Tong Chen, Nguyen Quoc Viet Hung, Hongzhi Yin
Abstract:
Large Language Model-based Time Series Forecasting (LLMTS) has shown remarkable promise in handling complex and diverse temporal data, representing a significant step toward foundation models for time series analysis. However, this emerging paradigm introduces two critical challenges. First, the substantial commercial potential and resource-intensive development raise urgent concerns about intellectual property (IP) protection. Second, their powerful time series forecasting capabilities may be misused to produce misleading or fabricated deepfake time series data. To address these concerns, we explore watermarking the outputs of LLMTS models, that is, embedding imperceptible signals into the generated time series data that remain detectable by specialized algorithms. We propose a novel post-hoc watermarking framework, Waltz, which is broadly compatible with existing LLMTS models. Waltz is inspired by the empirical observation that time series patch embeddings are rarely aligned with a specific set of LLM tokens, which we term ``cold tokens''. Leveraging this insight, Waltz embeds watermarks by rewiring the similarity statistics between patch embeddings and cold token embeddings, and detects watermarks using similarity z-scores. To minimize potential side effects, we introduce a similarity-based embedding position identification strategy and employ projected gradient descent to constrain the watermark noise within a defined boundary. Extensive experiments using two popular LLMTS models across seven benchmark datasets demonstrate that Waltz achieves high watermark detection accuracy with minimal impact on the quality of the generated time series.
中文摘要:基于大语言模型的时间序列预测面临知识产权保护和数据滥用的双重挑战,Waltz框架通过利用冷令牌相似性重布线嵌入隐形水印,有效解决这些问题。
English Summary: LLM-based time series forecasting faces IP protection and misuse risks, which the proposed Waltz framework addresses by embedding imperceptible watermarks through cold token similarity manipulation.
Authors:Kechi Zhang, Ge Li, Jia Li, Huangzhao Zhang, Yihong Dong, Jia Li, Jingjing Xu, Zhi Jin
Abstract:
The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the substantial progress it has facilitated, the Transformer architecture still has some limitations. One such intrinsic limitation is its inability to effectively capture the Chomsky hierarchy, such as regular expressions or deterministic context-free grammars. Drawing inspiration from pushdown automata, which efficiently resolve deterministic context-free grammars using stacks, we propose StackTrans to address the aforementioned issue within LLMs. Unlike previous approaches that modify the attention computation, StackTrans explicitly incorporates hidden state stacks between Transformer layers. This design maintains compatibility with existing frameworks like flash-attention. Specifically, our design features stack operations -- such as pushing and popping hidden states -- that are differentiable and can be learned in an end-to-end manner. Our comprehensive evaluation spans benchmarks for both Chomsky hierarchies and large-scale natural languages. Across these diverse tasks, StackTrans consistently outperforms standard Transformer models and other baselines. We have successfully scaled StackTrans up from 360M to 7B parameters. In particular, our from-scratch pretrained model StackTrans-360M outperforms several larger open-source LLMs with 2-3x more parameters, showcasing its superior efficiency and reasoning capability.
中文:提出的StackTrans模型通过在Transformer层间引入可微栈操作,增强了处理乔姆斯基层级任务的能力,同时保持与现有框架的兼容性,并在各类基准测试中展现出卓越性能。
English: The proposed StackTrans model enhances Transformer architecture by integrating differentiable stack operations between layers, enabling more effective handling of Chomsky hierarchy tasks while maintaining compatibility with existing frameworks and demonstrating superior performance across various benchmarks.
Authors:Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu
Abstract:
LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1\%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50\% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.
中文:Seed-Prover作为引理式推理模型,通过利用Lean反馈迭代优化证明,在IMO和Putnam问题上大幅超越现有方法,同时Seed-Geometry弥补了Lean的几何推理短板,在形式化数学竞赛中取得卓越成果。
English: Seed-Prover, a lemma-style reasoning model, significantly advances automated theorem proving by iteratively refining proofs using Lean feedback and outperforms previous methods on IMO and Putnam problems, while Seed-Geometry addresses Lean's geometry limitations to achieve high success in formal competitions.
Authors:Wei-You Liao, Yuxuan Du, Xinbiao Wang, Tian-Ci Tian, Yong Luo, Bo Du, Dacheng Tao, He-Liang Huang
Abstract:
The ongoing development of quantum processors is driving breakthroughs in scientific discovery. Despite this progress, the formidable cost of fabricating large-scale quantum processors means they will remain rare for the foreseeable future, limiting their widespread application. To address this bottleneck, we introduce the concept of predictive surrogates, which are classical learning models designed to emulate the mean-value behavior of a given quantum processor with provably computational efficiency. In particular, we propose two predictive surrogates that can substantially reduce the need for quantum processor access in diverse practical scenarios. To demonstrate their potential in advancing digital quantum simulation, we use these surrogates to emulate a quantum processor with up to 20 programmable superconducting qubits, enabling efficient pre-training of variational quantum eigensolvers for families of transverse-field Ising models and identification of non-equilibrium Floquet symmetry-protected topological phases. Experimental results reveal that the predictive surrogates not only reduce measurement overhead by orders of magnitude, but can also surpass the performance of conventional, quantum-resource-intensive approaches. Collectively, these findings establish predictive surrogates as a practical pathway to broadening the impact of advanced quantum processors.
中文: 预测替代模型作为高效的经典模型被提出,可模拟量子处理器行为,在量子模拟等应用中大幅减少量子资源需求并超越传统方法性能。
English: Predictive surrogates are introduced as efficient classical models that emulate quantum processors, significantly reducing quantum resource needs while outperforming conventional methods in applications like quantum simulation.
Authors:Yanghao Su, Jie Zhang, Yiming Li, Tianwei Zhang, Qing Guo, Weiming Zhang, Nenghai Yu, Nils Lukas, Wenbo Zhou
Abstract:
Backdoor unlearning aims to remove backdoor-related information while preserving the model's original functionality. However, existing unlearning methods mainly focus on recovering trigger patterns but fail to restore the correct semantic labels of poison samples. This limitation prevents them from fully eliminating the false correlation between the trigger pattern and the target label. To address this, we leverage boundary adversarial attack techniques, revealing two key observations. First, poison samples exhibit significantly greater distances from decision boundaries compared to clean samples, indicating they require larger adversarial perturbations to change their predictions. Second, while adversarial predicted labels for clean samples are uniformly distributed, those for poison samples tend to revert to their original correct labels. Moreover, the features of poison samples restore to closely resemble those of corresponding clean samples after adding adversarial perturbations. Building upon these insights, we propose Backdoor Unlearning via adversaRial bouNdary analysis (BURN), a novel defense framework that integrates false correlation decoupling, progressive data refinement, and model purification. In the first phase, BURN employs adversarial boundary analysis to detect poisoned samples based on their abnormal adversarial boundary distances, then restores their correct semantic labels for fine-tuning. In the second phase, it employs a feedback mechanism that tracks prediction discrepancies between the original backdoored model and progressively sanitized models, guiding both dataset refinement and model purification. Extensive evaluations across multiple datasets, architectures, and seven diverse backdoor attack types confirm that BURN effectively removes backdoor threats while maintaining the model's original performance.
中文: BURN是一种新颖的后门遗忘框架,通过对抗性边界分析检测并修正中毒样本,在保持模型原有性能的同时有效消除后门威胁。
English: BURN is a novel backdoor unlearning framework that uses adversarial boundary analysis to detect and correct poisoned samples, effectively removing backdoor threats while preserving the model's original performance.
Authors:Tianyu Zheng, Tianshun Xing, Qingshui Gu, Taoran Liang, Xingwei Qu, Xin Zhou, Yizhi Li, Zhoufutu Wen, Chenghua Lin, Wenhao Huang, Qian Liu, Ge Zhang, Zejun Ma
Abstract:
Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities of Large Language Models (LLMs) but it struggles with unstable exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a structured exploration framework that identifies high-uncertainty decision points in reasoning trajectories and performs targeted rollouts to construct semantically grounded intermediate feedback. Our method provides targeted guidance without relying on dense supervision. Empirical results on mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable training, produces longer and more coherent responses, and increases the proportion of fully correct trajectories. These results highlight the framework's effectiveness in improving LLM reasoning through more robust and structured exploration.
中文: FR3E提出了一种结构化探索框架,通过定位推理轨迹中的高不确定性决策点进行针对性探索,无需密集监督即可实现更稳定的训练并提升大语言模型的推理准确性。
English: FR3E introduces a structured exploration framework that targets high-uncertainty decision points in reasoning trajectories, enabling more stable training and improved reasoning accuracy in LLMs without dense supervision.
Authors:Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, Jason Eshraghian
Abstract:
Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.
Chinese: 通过系统评估72个开源模型发现,采用选择性门控机制且线性与完全注意力比例在3:1至6:1之间的混合注意力架构,能以更高效率实现与Transformer相当的召回性能。
English: Hybrid attention models combining linear and full attention layers can achieve Transformer-level recall efficiently when using selective gating mechanisms with linear-to-full ratios between 3:1 and 6:1, as demonstrated through systematic evaluation of 72 open-sourced models.
Authors:Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, Luming Li, Minghao Liu, Yihang Xia, Jiawei Shen, Yuchen Wu, Yixin Cao, Zhaoxiang Zhang, Wenhao Huang, Jiaheng Liu, Ge Zhang
Abstract:
Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase-the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models' ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 285K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation. Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations, and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.
中文: 本文提出CriticLean框架,通过将批评环节从被动验证提升为主动学习组件,结合新型模型、基准测试和大规模数据集,显著提升了形式化数学代码的语义保真度评估,强调了批评阶段对生成可靠形式化结果的关键作用。
English: This paper introduces CriticLean, a critic-guided reinforcement learning framework that enhances the semantic fidelity evaluation of formal mathematical code, and demonstrates its effectiveness through novel models, benchmarks, and a high-quality dataset, emphasizing the critic phase's crucial role in reliable formalization.
Authors:Wei Fan, Kejiang Chen, Chang Liu, Weiming Zhang, Nenghai Yu
Abstract:
The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. Our study reveals the limitations of adversarial perturbation-based VC defenses and underscores the urgent need for more robust solutions to mitigate the security and privacy risks posed by VC. The code and audio samples are available at https://de-antifake.github.io.
中文: 本研究评估了对抗性扰动在防御语音克隆中的局限性,并提出一种两阶段净化方法,能有效破坏现有防御措施,强调了开发更强大安全解决方案的紧迫性。
English: This study evaluates the limitations of adversarial perturbations in defending against voice cloning and introduces a two-stage purification method that effectively disrupts these defenses, highlighting the need for more robust security solutions.
Authors:Li Li, Yingzhe Peng, Xu Yang, Ruoxi Cheng, Haiyang Xu, Ming Yan, Fei Huang
Abstract:
We propose a novel embedding-based captioning metric termed as L-CLIPScore that can be used for efficiently evaluating caption quality and training captioning model. L-CLIPScore is calculated from a lightweight CLIP (L-CLIP), which is a dual-encoder architecture compressed and distilled from CLIP. To compress, we apply two powerful techniques which are weight multiplexing and matrix decomposition for reducing the parameters of encoders and word embedding matrix, respectively. To distill, we design a novel multi-modal Similarity Regulator (SR) loss to transfer more vision-language alignment knowledge. Specifically, SR loss amplifies the multi-modal embedding similarity if the given image-text pair is matched and diminishes the similarity if the pair is non-matched. By compressing and distilling by this novel SR loss, our L-CLIP achieves comparable multi-modal alignment ability to the original CLIP while it requires fewer computation resources and running time. We carry out exhaustive experiments to validate the efficiency and effectiveness of L-CLIPScore when using it as the judge to evaluate caption quality. We also discover that when using L-CLIPScore as the supervisor to train the captioning model, it should be mixed up by an n-gram-based metric and meanwhile analyze why using L-CLIPScore only will cause fail training.
中文: 本文提出了一种轻量级的嵌入式评估指标L-CLIPScore,通过参数压缩和知识蒸馏技术,在保持多模态对齐能力的同时显著降低了计算资源需求,可用于图像描述质量的评估与模型训练。
English: The authors introduce L-CLIPScore, a lightweight embedding-based metric for evaluating and training image captioning models, which achieves comparable performance to CLIP through compression and distillation techniques while reducing computational demands.
Authors:Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.
中文: 提出的PAPO算法通过引入隐式感知损失来强化多模态推理中的视觉感知能力,无需额外数据即可显著提升各类任务的性能表现。
English: The proposed PAPO algorithm enhances multimodal reasoning by integrating an Implicit Perception Loss to improve visual perception during reinforcement learning, achieving significant performance gains without extra resources.
Authors:Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou
Abstract:
Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.
中文: WebSailor是一种后训练方法,通过赋予开源模型在复杂信息搜索任务中系统性地降低极端不确定性的能力,使其性能媲美专有代理系统。
English: WebSailor is a post-training methodology that equips open-source models with the ability to systematically reduce extreme uncertainty in complex information-seeking tasks, enabling them to match the performance of proprietary agents.
Authors:Liangshun Wu, Wen Chen, Qingqing Wu, Xudong Bai, Kunlun Wang
Abstract:
Reconfigurable Intelligent Surfaces (RIS) promise transformative gains in wireless communications by enabling programmable control of the propagation environment through discrete phase configurations. In practical deployments, the control of RIS phase states is typically managed using finite codebooks, with configuration indices transmitted over low latency, yet imperfect, wireless feedback channels. Even rare feedback bit errors can lead to significant mismatches between intended and applied RIS states, degrading system performance. This paper addresses the challenge of robust RIS codebook index assignment by formulating it as a combinatorial optimization problem, equivalent to the Traveling Salesman Problem (TSP), where codewords are "cities" and edge weights reflect SNR degradation under codeword confusion. A novel three-phase heuristic algorithm is proposed to solve this, consisting of a provision phase, a shotgun phase, and a fuzzy concatenation phase. Simulation results show that the method outperforms conventional indexing strategies and achieves near-optimal robustness to index errors, while also being scalable and hardwareagnostic for real time deployment. Future work includes multiple bits error correction and online adaptive mapping for time varying channels.
中文摘要:本文通过将可重构智能表面码本索引分配问题建模为旅行商问题,提出了一种新颖的三阶段启发式算法,有效解决了反馈错误导致的性能下降问题,实现了接近最优的鲁棒性和实际部署的扩展性。
English Summary: This paper tackles the challenge of robust codebook index assignment for Reconfigurable Intelligent Surfaces (RIS) by formulating it as a Traveling Salesman Problem and proposing a novel three-phase heuristic algorithm that demonstrates superior performance and near-optimal robustness against feedback errors.
Authors:Yinglong Zou, Juan Zhai, Chunrong Fang, Yanzhou Mu, Jiawei Liu, Zhenyu Chen
Abstract:
Deep learning frameworks serve as the foundation for developing and deploying deep learning applications. To enhance the quality of deep learning frameworks, researchers have proposed numerous testing methods using deep learning models as test inputs. However, existing methods predominantly measure model bug detection effectiveness as heuristic indicators, presenting three critical limitations: Firstly, existing methods fail to quantitatively measure model's operator combination variety, potentially missing critical operator combinations that could trigger framework bugs. Secondly, existing methods neglect measuring model execution time, resulting in the omission of numerous models potential for detecting more framework bugs within limited testing time. Thirdly, existing methods overlook correlation between different model measurements, relying simply on single-indicator heuristic guidance without considering their trade-offs. To overcome these limitations, we propose DLMMM, the first deep learning framework testing method to include multiple model measurements into heuristic guidance and fuse these measurements to achieve their trade-off. DLMMM firstly quantitatively measures model's bug detection performance, operator combination variety, and model execution time. After that, DLMMM fuses the above measurements based on their correlation to achieve their trade-off. To further enhance testing effectiveness, DLMMM designs multi-level heuristic guidance for test input model generation.
Chinese Summary: DLMMM是一种创新的深度学习框架测试方法,通过融合模型错误检测性能、算子组合多样性和执行时间等多维度测量,并设计多层次启发式指导,有效解决了现有方法依赖单一指标而忽略权衡的缺陷。
English Summary: DLMMM is a novel deep learning framework testing method that integrates multiple model measurements—bug detection performance, operator combination variety, and execution time—and fuses them with multi-level heuristic guidance to overcome the limitations of existing single-indicator approaches.
Authors:Yanzhou Mu, Juan Zhai, Chunrong Fang, Xiang Chen, Zhixiang Cao, Peiran Yang, Yinglong Zou, Tao Zheng, Zhenyu Chen
Abstract:
Deep learning (DL) frameworks are the fundamental infrastructure for various DL applications. Framework defects can profoundly cause disastrous accidents, thus requiring sufficient detection. In previous studies, researchers adopt DL models as test inputs combined with mutation to generate more diverse models. Though these studies demonstrate promising results, most detected defects are considered trivial (i.e., either treated as edge cases or ignored by the developers). To identify important bugs that matter to developers, we propose a novel DL framework testing method DevMuT, which generates models by adopting mutation operators and constraints derived from developer expertise. DevMuT simulates developers'common operations in development and detects more diverse defects within more stages of the DL model lifecycle (e.g., model training and inference). We evaluate the performance of DevMuT on three widely used DL frameworks (i.e., PyTorch, JAX, and Mind- Spore) with 29 DL models from nine types of industry tasks. The experiment results show that DevMuT outperforms state-of-the-art baselines: it can achieve at least 71.68% improvement on average in the diversity of generated models and 28.20% improvement on average in the legal rates of generated models. Moreover, DevMuT detects 117 defects, 63 of which are confirmed, 24 are fixed, and eight are of high value confirmed by developers. Finally, DevMuT has been deployed in the MindSpore community since December 2023. These demonstrate the effectiveness of DevMuT in detecting defects that are close to the real scenes and are of concern to developers.
中文: DevMuT是一种新颖的深度学习框架测试方法,通过基于开发者经验的变异操作生成多样化模型,能有效检测深度学习生命周期各阶段的重要缺陷,在实际应用中展现出卓越性能。
English: DevMuT is a novel deep learning framework testing method that uses developer-informed mutations to generate diverse models, effectively detecting significant defects across multiple stages of the DL lifecycle and demonstrating superior performance in real-world applications.
Authors:Yanzhou Mu, Juan Zhai, Chunrong Fang, Xiang Chen, Zhixiang Cao, Peiran Yang, Kexin Zhao, An Guo, Zhenyu Chen
Abstract:
Deep learning (DL) frameworks are essential to DL-based software systems, and framework bugs may lead to substantial disasters, thus requiring effective testing. Researchers adopt DL models or single interfaces as test inputs and analyze their execution results to detect bugs. However, floating-point errors, inherent randomness, and the complexity of test inputs make it challenging to analyze execution results effectively, leading to existing methods suffering from a lack of suitable test oracles. Some researchers utilize metamorphic testing to tackle this challenge. They design Metamorphic Relations (MRs) based on input data and parameter settings of a single framework interface to generate equivalent test inputs, ensuring consistent execution results between original and generated test inputs. Despite their promising effectiveness, they still face certain limitations. (1) Existing MRs overlook structural complexity, limiting test input diversity. (2) Existing MRs focus on limited interfaces, which limits generalization and necessitates additional adaptations. (3) Their detected bugs are related to the result consistency of single interfaces and far from those exposed in multi-interface combinations and runtime metrics (e.g., resource usage). To address these limitations, we propose ModelMeta, a model-level metamorphic testing method for DL frameworks with four MRs focused on the structure characteristics of DL models. ModelMeta augments seed models with diverse interface combinations to generate test inputs with consistent outputs, guided by the QR-DQN strategy. It then detects bugs through fine-grained analysis of training loss/gradients, memory/GPU usage, and execution time.
中文: 深度学习框架测试面临浮点误差和输入复杂性的挑战,因此提出ModelMeta这一模型级蜕变测试方法,通过结构变换生成多样化测试输入,并通过细粒度运行时分析检测缺陷。
English: Deep learning framework testing faces challenges due to floating-point errors and input complexity, prompting the proposal of ModelMeta, a model-level metamorphic testing method that generates diverse test inputs through structural transformations and detects bugs via fine-grained runtime analysis.
Authors:Yiting Qu, Ziqing Yang, Yihan Ma, Michael Backes, Savvas Zannettou, Yang Zhang
Abstract:
Recent advances in text-to-image diffusion models have enabled the creation of a new form of digital art: optical illusions--visual tricks that create different perceptions of reality. However, adversaries may misuse such techniques to generate hateful illusions, which embed specific hate messages into harmless scenes and disseminate them across web communities. In this work, we take the first step toward investigating the risks of scalable hateful illusion generation and the potential for bypassing current content moderation models. Specifically, we generate 1,860 optical illusions using Stable Diffusion and ControlNet, conditioned on 62 hate messages. Of these, 1,571 are hateful illusions that successfully embed hate messages, either overtly or subtly, forming the Hateful Illusion dataset. Using this dataset, we evaluate the performance of six moderation classifiers and nine vision language models (VLMs) in identifying hateful illusions. Experimental results reveal significant vulnerabilities in existing moderation models: the detection accuracy falls below 0.245 for moderation classifiers and below 0.102 for VLMs. We further identify a critical limitation in their vision encoders, which mainly focus on surface-level image details while overlooking the secondary layer of information, i.e., hidden messages. To address this risk, we explore preliminary mitigation measures and identify the most effective approaches from the perspectives of image transformations and training-level strategies.
中文摘要:近期文本生成图像模型能制作嵌入仇恨信息的视觉错觉,现有审核系统存在显著漏洞,分类器检测准确率低于0.245,视觉语言模型低于0.102,构成严重内容安全风险。
English Summary: Recent text-to-image models can create optical illusions that embed hate messages, posing risks as current moderation systems show significant vulnerabilities with detection accuracy below 0.245 for classifiers and 0.102 for vision-language models.
Authors:Yiting Qu, Michael Backes, Yang Zhang
Abstract:
Vision-language models (VLMs) are increasingly applied to identify unsafe or inappropriate images due to their internal ethical standards and powerful reasoning abilities. However, it is still unclear whether they can recognize various unsafe concepts when presented in different modalities, such as text and images. To address this, we first compile the UnsafeConcepts dataset, featuring 75 unsafe concepts, i.e., ``Swastika,'' ``Sexual Harassment,'' and ``Assaults,'' along with associated 1.5K images. We then conduct a systematic evaluation of VLMs' perception (concept recognition) and alignment (ethical reasoning) capabilities. We assess eight popular VLMs and find that, although most VLMs accurately perceive unsafe concepts, they sometimes mistakenly classify these concepts as safe. We also identify a consistent modality gap among open-source VLMs in distinguishing between visual and textual unsafe concepts. To bridge this gap, we introduce a simplified reinforcement learning (RL)-based approach using proximal policy optimization (PPO) to strengthen the ability to identify unsafe concepts from images. Our approach uses reward scores based directly on VLM responses, bypassing the need for collecting human-annotated preference data to train a new reward model. Experimental results show that our approach effectively enhances VLM alignment on images while preserving general capabilities. It outperforms baselines such as supervised fine-tuning (SFT) and direct preference optimization (DPO). We hope our dataset, evaluation findings, and proposed alignment solution contribute to the community's efforts in advancing safe VLMs.
Chinese: 视觉语言模型(VLMs)能够识别不安全概念,但常将其误判为安全,因此提出了一种基于强化学习的方法,无需人工标注数据即可增强模型的对齐能力,并优于现有基准。
English: Vision-language models (VLMs) can recognize unsafe concepts but often misclassify them as safe, prompting the development of a reinforcement learning approach that improves alignment without human-annotated data, outperforming existing methods.
Authors:Zheng Yang, Kuan Xu, Shenghai Yuan, Lihua Xie
Abstract:
In this paper, we introduce a novel approach for efficiently estimating the 6-Degree-of-Freedom (DoF) robot pose with a decoupled, non-iterative method that capitalizes on overlapping planar elements. Conventional RGB-D visual odometry(RGBD-VO) often relies on iterative optimization solvers to estimate pose and involves a process of feature extraction and matching. This results in significant computational burden and time delays. To address this, our innovative method for RGBD-VO separates the estimation of rotation and translation. Initially, we exploit the overlaid planar characteristics within the scene to calculate the rotation matrix. Following this, we utilize a kernel cross-correlator (KCC) to ascertain the translation. By sidestepping the resource-intensive iterative optimization and feature extraction and alignment procedures, our methodology offers improved computational efficacy, achieving a performance of 71Hz on a lower-end i5 CPU. When the RGBD-VO does not rely on feature points, our technique exhibits enhanced performance in low-texture degenerative environments compared to state-of-the-art methods.
中文: 本文提出一种解耦非迭代的6自由度机器人位姿估计方法,利用重叠平面特征计算旋转矩阵并通过核互相关器确定平移,在低端硬件上实现71Hz性能,在低纹理退化环境中表现优于现有方法。
English: This paper presents a decoupled, non-iterative method for 6-DoF robot pose estimation that uses overlapping planar elements to compute rotation and a kernel cross-correlator for translation, achieving 71Hz performance on low-end hardware with improved efficiency in low-texture environments.
Authors:Weihuang Lin, Yiwei Ma, Xiaoshuai Sun, Shuting He, Jiayi Ji, Liujuan Cao, Rongrong Ji
Abstract:
The reasoning segmentation task involves segmenting objects within an image by interpreting implicit user instructions, which may encompass subtleties such as contextual cues and open-world knowledge. Despite significant advancements made by existing approaches, they remain constrained by low perceptual resolution, as visual encoders are typically pre-trained at lower resolutions. Furthermore, simply interpolating the positional embeddings of visual encoders to enhance perceptual resolution yields only marginal performance improvements while incurring substantial computational costs. To address this, we propose HRSeg, an efficient model with high-resolution fine-grained perception. It features two key innovations: High-Resolution Perception (HRP) and High-Resolution Enhancement (HRE). The HRP module processes high-resolution images through cropping, integrating local and global features for multi-granularity quality. The HRE module enhances mask features by integrating fine-grained information from high-resolution images, refining their alignment with text features for precise segmentation. Extensive ablation studies validate the effectiveness of our modules, while comprehensive experiments on multiple benchmark datasets demonstrate HRSeg's superior performance.
中文: HRSeg提出了一种高效的高分辨率细粒度感知模型,通过HRP模块实现多粒度特征整合和HRE模块增强掩码与文本对齐,在多个基准测试中展现出卓越的分割性能。
English: HRSeg introduces an efficient model with high-resolution fine-grained perception, featuring HRP for multi-granularity feature integration and HRE for enhanced mask-text alignment, achieving superior segmentation performance across benchmarks.
Authors:Ziyin Zhou, Yunpeng Luo, Yuanchen Wu, Ke Sun, Jiayi Ji, Ke Yan, Shouhong Ding, Xiaoshuai Sun, Yunsheng Wu, Rongrong Ji
Abstract:
The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on three benchmarks validate the effectiveness of our AIGI-Holmes.
中文: 针对AI生成图像(AIGI)检测技术缺乏可解释性和泛化能力的问题,本研究提出了Holmes-Set数据集和三阶段训练框架,通过多专家评审机制和协作解码策略,开发出能提供人类可验证解释且具有强泛化能力的AIGI-Holmes检测模型。
English: The rapid advancement of AI-generated images (AIGI) poses risks to information security, and to address the limitations of existing detection methods, this study introduces the Holmes-Set dataset and a three-stage training pipeline that enhances AIGI detection with human-verifiable explanations and improved generalization.
Authors:Qinglong Yang, Haoming Li, Haotian Zhao, Xiaokai Yan, Jingtao Ding, Fengli Xu, Yong Li
Abstract:
Mobile GUI agents are becoming critical tools for enhancing human-device interaction efficiency, with multimodal large language models (MLLMs) emerging as dominant paradigms in this domain. Current agents, however, are limited to following explicit human instructions, resulting in insufficient capability for proactive intent anticipation. Additionally, these agents fail to leverage the contextual information associated with users during task execution, thereby neglecting potentially vast differences in user preferences. To address these challenges, we introduce the FingerTip benchmark. It contains two new tracks: proactive task suggestions by analyzing environment observation and users' previous intents, and personalized task execution by catering to users' action preferences. We collected unique human demonstrations of multi-step Android device interactions across a variety of everyday apps. These demonstrations are not isolated but are continuously acquired from the users' long-term usage in their real lives, and encompass essential user-related contextual information. Our experiments reveal challenges of the tasks we propose. The model fine-tuned with the data we collected effectively utilized user information and achieved good results, highlighting the potential of our approach in building more user-oriented mobile GUI agents. Our code is open-source at https://anonymous.4open.science/r/FingerTip-57B8 for reproducibility.
中文摘要:FingerTip基准通过引入基于环境观察和用户历史意图的主动任务建议及个性化执行功能,解决了当前移动GUI代理缺乏主动性和个性化的问题,利用真实用户交互数据显著提升了代理的实用性和用户导向性。
English Summary: The FingerTip benchmark is introduced to overcome limitations in current mobile GUI agents by enabling proactive intent anticipation and personalized task execution through user-specific contextual data, demonstrating improved performance in user-oriented interactions.
Authors:Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang
Abstract:
Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.
中文: 提出的Stitch方法通过交替生成内部推理块和语音响应块,使口语模型能够实现边思考边说话,在消除额外延迟的同时显著提升了推理任务的表现。
English: The proposed Stitch method enables Spoken Language Models to perform unspoken chain-of-thought reasoning simultaneously with speech generation by alternating between reasoning and response chunks, eliminating additional latency while improving performance on reasoning tasks.
Authors:Yiyang Zhou, Linjie Li, Shi Qiu, Zhengyuan Yang, Yuyang Zhao, Siwei Han, Yangfan He, Kangqi Li, Haonian Ji, Zihao Zhao, Haibo Tong, Lijuan Wang, Huaxiu Yao
Abstract:
Existing video benchmarks often resemble image-based benchmarks, with question types like "What actions does the person perform throughout the video?" or "What color is the woman's dress in the video?" For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce GLIMPSE, a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, GLIMPSE emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context-this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos.
Chinese: 现有视频基准常使模型仅通过分析关键帧即可回答问题,无需深度时序推理,这限制了对大型视觉语言模型(LVLM)真实视频理解能力的评估;为此,我们推出了GLIMPSE这一专门基准,其人工设计的问题需基于完整视频上下文进行分析,人类评估准确率达94.82%,而最佳LVLM模型GPT-o3仅达66.43%,凸显了它们在实现真正视频思维方面的不足。
English: Current video benchmarks often allow models to answer questions by analyzing key frames alone, lacking the need for deep temporal reasoning, which limits the assessment of large vision-language models' (LVLMs) true video comprehension; to address this, GLIMPSE is introduced as a specialized benchmark with human-crafted questions that require full video context analysis, revealing that while humans achieve 94.82% accuracy, top LVLMs like GPT-o3 only reach 66.43%, underscoring their struggle with genuine video-based thinking.
Authors:Yu Zheng, Jingtao Ding, Depeng Jin, Jianxi Gao, Yong Li
Abstract:
Many complex networks display remarkable resilience under external perturbations, internal failures and environmental changes, yet they can swiftly deteriorate into dysfunction upon the removal of a few keystone nodes. Discovering theories that measure network resilience offers the potential to prevent catastrophic collapses--from species extinctions to financial crise--with profound implications for real-world systems. Current resilience theories address the problem from a single perspective of topology, neglecting the crucial role of system dynamics, due to the intrinsic complexity of the coupling between topology and dynamics which exceeds the capabilities of human analytical methods. Here, we report an automatic method for resilience theory discovery, which learns from how AI solves a complicated network dismantling problem and symbolizes its network attack strategies into theoretical formulas. This proposed self-inductive approach discovers the first resilience theory that accounts for both topology and dynamics, highlighting how the correlation between node degree and state shapes overall network resilience, and offering insights for designing early warning signals of systematic collapses. Additionally, our approach discovers formulas that refine existing well-established resilience theories with over 37.5% improvement in accuracy, significantly advancing human understanding of complex networks with AI.
Chinese: 本研究提出一种自动方法,通过人工智能学习复杂网络瓦解策略来发现网络韧性理论,首次构建了融合拓扑与动力学的韧性理论,揭示节点度与状态相关性如何影响整体韧性,并将现有理论精度提升超过37.5%。
English: This study introduces an automated method that discovers network resilience theories by learning AI strategies for dismantling complex networks, producing the first theory integrating both topology and dynamics to reveal how node degree-state correlations shape resilience and improve existing theories by over 37.5% in accuracy.
Authors:Keer Lu, Zheng Liang, Youquan Li, Jiejun Tan, Da Pan, Shusen Zhang, Guosheng Dong, Huang Leng, Bin Cui, Wentao Zhang
Abstract:
In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce **Med-R$^3$**, a **Med**ical **R**etrieval-augmented **R**easoning framework driven by progressive **R**einforcement learning. In this framework, we first develop the model's ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model's retrieval and reasoning coordination. Extensive experiments indicate that **Med-R$^3$** could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93\% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53\%.
中文摘要:提出的Med-R³框架通过渐进式强化学习联合优化检索与推理能力,解决了当前医疗AI系统的局限性,实现了超越闭源模型的顶尖性能表现。
English Summary: The proposed Med-R³ framework addresses the limitations of current medical AI systems by jointly optimizing retrieval and reasoning capabilities through progressive reinforcement learning, achieving state-of-the-art performance that surpasses even closed-source models.
Authors:Keer Lu, Zheng Liang, Youquan Li, Jiejun Tan, Da Pan, Shusen Zhang, Guosheng Dong, Zhonghai Wu, Huang Leng, Bin Cui, Wentao Zhang
Abstract:
In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce **Med-R$^3$**, a **Med**ical **R**etrieval-augmented **R**easoning framework driven by progressive **R**einforcement learning. In this framework, we first develop the model's ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model's retrieval and reasoning coordination. Extensive experiments indicate that **Med-R$^3$** could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93\% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53\%.
中文摘要:提出的Med-R³框架通过渐进式强化学习联合优化检索与推理能力,解决了当前医疗AI系统的局限性,实现了超越闭源模型的顶尖性能表现。
English Summary: The proposed Med-R³ framework addresses the limitations of current medical AI systems by jointly optimizing retrieval and reasoning capabilities through progressive reinforcement learning, achieving state-of-the-art performance that surpasses even closed-source models.
Authors:Lingfeng He, De Cheng, Zhiheng Ma, Huaijie Wang, Dingwen Zhang, Nannan Wang, Xinbo Gao
Abstract:
Continual Learning (CL) empowers AI models to continuously learn from sequential task streams. Recently, parameter-efficient fine-tuning (PEFT)-based CL methods have garnered increasing attention due to their superior performance. They typically allocate a unique sub-module for learning each task, with a task recognizer to select the appropriate sub-modules for testing images. However, due to the feature subspace misalignment from independently trained sub-modules, these methods tend to produce ambiguous decisions under misleading task-ids. To address this, we propose Cross-subspace Knowledge Alignment and Aggregation (CKAA), a novel framework that enhances model robustness against misleading task-ids through two key innovations: (1) Dual-level Knowledge Alignment (DKA): By aligning intra-class feature distributions across different subspaces and learning a robust global classifier through a feature simulation process, DKA enables the model to distinguish features from both correct and incorrect subspaces during training. (2) Task-Confidence-guided Mixture of Adapters (TC-MoA): A robust inference scheme that adaptively aggregates task-specific knowledge from relevant sub-modules based on task-confidence scores, avoiding overconfidence in misleading task-id predictions. Extensive experiments demonstrate that CKAA outperforms existing PEFT-based CL methods.
中文:提出的跨子空间知识对齐与聚合框架通过双层级知识对齐和基于任务置信度的适配器混合机制,有效提升了持续学习模型对误导性任务标识的鲁棒性,性能优于现有参数高效方法。
English: The proposed Cross-subspace Knowledge Alignment and Aggregation (CKAA) framework enhances continual learning robustness against misleading task-ids through dual-level knowledge alignment and a task-confidence-guided mixture of adapters, outperforming existing parameter-efficient methods.
Authors:De Cheng, Zhipeng Xu, Xinyang Jiang, Dongsheng Li, Nannan Wang, Xinbo Gao
Abstract:
Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains. Notably, recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models. Despite the increasing attention toward VFM-based domain prompt tuning within DG, the effective design of prompts capable of disentangling invariant features across diverse domains remains a critical challenge. In this paper, we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Noting that the text modality of VFMs is naturally easier to disentangle, we introduce a novel framework for text feature-guided visual prompt tuning. This framework first automatically disentangles the text prompt using a large language model (LLM) and then learns domain-invariant visual representation guided by the disentangled text feature. However, relying solely on language to guide visual feature disentanglement has limitations, as visual features can sometimes be too complex or nuanced to be fully captured by descriptive text. To address this, we introduce Worst Explicit Representation Alignment (WERA), which extends text-guided visual prompts by incorporating an additional set of abstract prompts. These prompts enhance source domain diversity through stylized image augmentations, while alignment constraints ensure that visual representations remain consistent across both the original and augmented distributions. Experiments conducted on major DG datasets, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that our proposed method outperforms state-of-the-art DG methods.
Chinese: 本文提出了一种文本特征引导的视觉提示调优框架,利用大语言模型解耦文本提示并结合最差显式表征对齐方法增强跨领域不变性视觉表示,在多个领域泛化基准测试中取得了最优性能。
English: This paper introduces a text feature-guided visual prompt tuning framework that leverages disentangled language prompts from LLMs and incorporates Worst Explicit Representation Alignment (WERA) to enhance domain-invariant visual representations, achieving state-of-the-art performance on multiple domain generalization benchmarks.
Authors:Xiaoyuan Li, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu, Junyang Lin
Abstract:
Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM's ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM's code-based capabilities in multi-modal mathematical reasoning.Specifically, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model's ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model's capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.
中文: 本研究评估多模态大语言模型通过代码生成与编辑执行精确视觉操作的能力,发现在各类数学图形处理中现有模型与人类表现仍存在显著差距。
English: This study evaluates Multi-modal Large Language Models' ability to perform precise visual operations through code generation and editing in mathematical reasoning, revealing significant gaps compared to human performance across diverse mathematical figures.
Authors:Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, Kai Chen
Abstract:
Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.
中文: 本文提出SOPHIA方法,通过结合可训练视觉语言模型的在线视觉理解与语言模型的离线慢思考推理,并传播视觉奖励,有效增强大模型的慢思考推理能力,在多个基准测试中达到最优性能。
English: This paper introduces SOPHIA, a semi-off-policy reinforcement learning method that enhances large vision-language models' slow-thinking reasoning by combining on-policy visual understanding with off-policy reasoning and propagating visual rewards, achieving state-of-the-art performance on multiple benchmarks.
Authors:Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Abstract:
The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks through set theory. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.
中文: WebShaper框架通过集合论形式化信息检索任务,并采用多步扩展过程合成高质量数据集,解决了信息检索智能体训练数据稀缺的问题,在基准测试中取得了领先性能。
English: The WebShaper framework addresses the scarcity of training data for information-seeking agents by formalizing tasks through set theory and using a multi-step expansion process to synthesize high-quality datasets, achieving state-of-the-art performance on benchmarks.
Authors:Boyi Deng, Yu Wan, Baosong Yang, Fei Huang, Wenjie Wang, Fuli Feng
Abstract:
Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose $\textbf{S}$parse $\textbf{A}$utoencoder-guided $\textbf{S}$upervised $\textbf{F}$ine$\textbf{t}$uning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50\% compared to standard supervised fine-tuning, with complete elimination in four cases. Moreover, SASFT maintains or even improves the models' performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities.
中文摘要:大语言模型存在意外的语码转换问题,但本研究提出的SASFT方法通过调控特征预激活值,能在保持多语言能力的同时将意外语码转换减少50%以上。
English Summary: Large Language Models often exhibit unexpected code-switching that impairs usability, but the proposed SASFT method effectively reduces this issue by over 50% while maintaining multilingual performance through controlled feature pre-activation.
Authors:Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen
Abstract:
Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.
中文: 本文提出图灵机模仿学习(TAIL),通过合成模拟图灵机执行过程的思维链数据来提升基于Transformer的大语言模型在可计算推理任务中的长度泛化能力,实验表明该方法显著优于现有方法。
English: This paper introduces Turing MAchine Imitation Learning (TAIL), a method that synthesizes chain-of-thought data mimicking Turing Machine execution to enhance Transformer-based LLMs' length generalization across various computable reasoning tasks, significantly outperforming previous approaches in experiments.
Authors:Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Dahua Lin, Kai Chen
Abstract:
Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.
中文: 本文提出图灵机模仿学习(TAIL),通过合成模拟图灵机执行过程的思维链数据来提升基于Transformer的大语言模型在可计算推理任务中的长度泛化能力,实验表明该方法显著优于现有方法。
English: This paper introduces Turing MAchine Imitation Learning (TAIL), a method that synthesizes chain-of-thought data mimicking Turing Machine execution to enhance Transformer-based LLMs' length generalization across various computable reasoning tasks, significantly outperforming previous approaches in experiments.
Authors:Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu
Abstract:
Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4oÅ content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.
中文摘要:AnyCap项目提出了一个即插即用框架(ACM),无需重新训练基础模型即可增强多模态描述的精细控制能力,并配套构建了涵盖三大模态的大规模数据集(ACD)和新型评估基准,在多个测试中显著提升了内容准确性与风格契合度。
English Summary: The AnyCap Project introduces a plug-and-play framework (ACM) that enhances existing foundation models' caption controllability across multiple modalities, supported by a comprehensive dataset (ACD) and a new evaluation benchmark demonstrating significant improvements in content and style scores.
Authors:Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen
Abstract:
Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.
Chinese: 大语言模型在代码生成基准测试中的表现因测试用例有限而被高估,而我们提出的人机协作方法SAGA显著提升了测试覆盖率和质量,实现了90.62%的缺陷检测率,推动了可靠评估的发展。
English: Large language models' performance in code-generation benchmarks is often overstated due to limited test cases, but our proposed human-LLM collaborative method SAGA significantly improves test coverage and quality, achieving a 90.62% fault detection rate and advancing reliable evaluation.
Authors:Honghui Bao, Wenjie Wang, Xinyu Lin, Fengbin Zhu, Teng Sun, Fuli Feng, Tat-Seng Chua
Abstract:
Leveraging Large Language Models (LLMs) for recommendation has demonstrated notable success in various domains, showcasing their potential for open-domain recommendation. A key challenge to advancing open-domain recommendation lies in effectively modeling user preferences from users' heterogeneous behaviors across multiple domains. Existing approaches, including ID-based and semantic-based modeling, struggle with poor generalization, an inability to compress noisy interactions effectively, and the domain seesaw phenomenon. To address these challenges, we propose a Heterogeneous User Modeling (HUM) method, which incorporates a compression enhancer and a robustness enhancer for LLM-based recommendation. The compression enhancer uses a customized prompt to compress heterogeneous behaviors into a tailored token, while a masking mechanism enhances cross-domain knowledge extraction and understanding. The robustness enhancer introduces a domain importance score to mitigate the domain seesaw phenomenon by guiding domain optimization. Extensive experiments on heterogeneous datasets validate that HUM effectively models user heterogeneity by achieving both high efficacy and robustness, leading to superior performance in open-domain recommendation.
中文摘要:提出的异构用户建模(HUM)方法通过定制化提示压缩用户多领域行为,并结合领域重要性评分机制,有效解决了开放域推荐中的领域失衡问题,实现了优越的推荐性能。
English Summary: The proposed Heterogeneous User Modeling (HUM) method enhances LLM-based recommendations by compressing diverse user behaviors into tailored tokens and mitigating domain imbalance through specialized mechanisms, achieving superior performance in open-domain settings.
Authors:Longfei Li, Zhiwen Fan, Wenyan Cong, Xinhang Liu, Yuyang Yin, Matt Foutter, Panwang Pan, Chenyu You, Yue Wang, Zhangyang Wang, Yao Zhao, Marco Pavone, Yunchao Wei
Abstract:
Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.
中文摘要:本研究提出了一种结合M3arsSynth数据管道(从NASA图像重建火星三维环境)和MarsGen视频生成器的完整解决方案,能生成视觉逼真且几何一致的火星景观视频,在视觉保真度和三维结构一致性上均优于基于地球数据训练的模型。
English Summary: This study introduces a comprehensive framework combining the M3arsSynth data pipeline for reconstructing 3D Martian environments from NASA imagery and the MarsGen video generator to produce realistic Martian landscape videos with enhanced visual and geometric accuracy compared to terrestrial-trained models.
Authors:Yiming Zhong, Xiaolin Zhang, Ligang Liu, Yao Zhao, Yunchao Wei
Abstract:
Similar to facial beautification in real life, 3D virtual avatars require personalized customization to enhance their visual appeal, yet this area remains insufficiently explored. Although current 3D Gaussian editing methods can be adapted for facial makeup purposes, these methods fail to meet the fundamental requirements for achieving realistic makeup effects: 1) ensuring a consistent appearance during drivable expressions, 2) preserving the identity throughout the makeup process, and 3) enabling precise control over fine details. To address these, we propose a specialized 3D makeup method named AvatarMakeup, leveraging a pretrained diffusion model to transfer makeup patterns from a single reference photo of any individual. We adopt a coarse-to-fine idea to first maintain the consistent appearance and identity, and then to refine the details. In particular, the diffusion model is employed to generate makeup images as supervision. Due to the uncertainties in diffusion process, the generated images are inconsistent across different viewpoints and expressions. Therefore, we propose a Coherent Duplication method to coarsely apply makeup to the target while ensuring consistency across dynamic and multiview effects. Coherent Duplication optimizes a global UV map by recoding the averaged facial attributes among the generated makeup images. By querying the global UV map, it easily synthesizes coherent makeup guidance from arbitrary views and expressions to optimize the target avatar. Given the coarse makeup avatar, we further enhance the makeup by incorporating a Refinement Module into the diffusion model to achieve high makeup quality. Experiments demonstrate that AvatarMakeup achieves state-of-the-art makeup transfer quality and consistency throughout animation.
中文: 本文提出的AvatarMakeup方法通过扩散模型从单张参考照片迁移妆容,采用从粗到细的策略和连贯复制技术,解决了现有3D虚拟形象化妆在动态表情一致性、身份保持和细节控制方面的不足。
English: The proposed AvatarMakeup method addresses limitations in current 3D facial beautification by using a diffusion model to transfer makeup from a reference photo, ensuring consistency across expressions and views through coherent duplication and refinement techniques.
Authors:Yulai Zhao, Haolin Liu, Dian Yu, S. Y. Kung, Haitao Mi, Dong Yu
Abstract:
Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., ":" or ".") or reasoning openers like "Thought process:" and "Let's solve this problem step by step." can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.
中文: 作为自动评分器的大语言模型易受"万能钥匙"式表面输入导致的奖励破解,但通过数据增强策略构建的Master Reward模型能有效抵御此类攻击且保持标准评估性能。
English: Large language models used as automated judges are vulnerable to reward hacking through superficial inputs called "master keys," but a proposed data augmentation strategy creates robust Master Reward Models that resist these attacks while maintaining standard performance.
Authors:Yulai Zhao, Haolin Liu, Dian Yu, Sunyuan Kung, Meijia Chen, Haitao Mi, Dong Yu
Abstract:
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ''master keys'' such as non-word symbols (e.g., '':'' or ''.'') or generic reasoning openers (e.g., ''Thought process:'' or ''Let's solve this problem step by step.''), can consistently elicit false positive rewards without any substantive reasoning. Our systematic evaluation demonstrates this is a widespread failure affecting a diverse range of models, including leading proprietary systems such as GPT-o1 and Claude-4. These results challenge the assumed robustness of LLM judges and pose a significant threat to their reliability. To address this, we propose a simple yet effective data augmentation strategy using truncated model outputs as adversarial negative examples. The resulting Master Reward Models (Master-RMs) demonstrate state-of-the-art robustness against these ''master key'' attacks while maintaining high performance in standard evaluation settings. We supplement these findings with a comprehensive analysis of the vulnerability across model scales, prompt variations, and common inference-time strategies, offering insights to guide future research on robust LLM evaluation. We release our robust, general-domain reward models and the synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.
中文: 作为自动评分器的大语言模型易受"万能钥匙"式表面输入导致的奖励破解,但通过数据增强策略构建的Master Reward模型能有效抵御此类攻击且保持标准评估性能。
English: Large language models used as automated judges are vulnerable to reward hacking through superficial inputs called "master keys," but a proposed data augmentation strategy creates robust Master Reward Models that resist these attacks while maintaining standard performance.
Authors:Zhenwen Liang, Linfeng Song, Yang Li, Tao Yang, Feng Zhang, Haitao Mi, Dong Yu
Abstract:
Automated Theorem Proving (ATP) in formal languages is a foundational challenge for AI. While Large Language Models (LLMs) have driven remarkable progress, a significant gap remains between their powerful informal reasoning capabilities and their weak formal proving performance. Recent studies show that the informal accuracy exceeds 80% while formal success remains below 8% on benchmarks like PutnamBench. We argue this gap persists because current state-of-the-art provers, by tightly coupling reasoning and proving, are trained with paradigms that inadvertently punish deep reasoning in favor of shallow, tactic-based strategies. To bridge this fundamental gap, we propose a novel framework that decouples high-level reasoning from low-level proof generation. Our approach utilizes two distinct, specialized models: a powerful, general-purpose Reasoner to generate diverse, strategic subgoal lemmas, and an efficient Prover to rigorously verify them. This modular design liberates the model's full reasoning potential and bypasses the pitfalls of end-to-end training. We evaluate our method on a challenging set of post-2000 IMO problems, a problem set on which no prior open-source prover has reported success. Our decoupled framework successfully solves 5 of these problems, demonstrating a significant step towards automated reasoning on exceptionally difficult mathematical challenges. To foster future research, we release our full dataset of generated and verified lemmas for a wide range of IMO problems, available at https://tencent-imo.github.io/ .
中文摘要:本文提出了一种新颖框架,通过分离推理与证明生成并采用专门化模型,成功解决了五道高难度国际数学奥林匹克问题,并发布了数据集以推动自动定理证明领域的进一步发展。
English Summary: This paper introduces a novel framework that decouples reasoning from proof generation using specialized models to bridge the performance gap in automated theorem proving, successfully solving five challenging IMO problems and releasing a dataset to advance future research.
Authors:Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, Dong Yu
Abstract:
Recently, there has been a surge of vision-based GUI agents designed to automate everyday mobile and web tasks. These agents interpret raw GUI screenshots and autonomously decide where to click, scroll, or type, which bypasses handcrafted rules and app-specific APIs. However, most existing methods trained GUI agent in the offline environment using pre-collected trajectories. This approach limits scalability, causes overfitting to specific UI templates, and leads to brittle policies when faced with unseen environment. We present MobileGUI-RL, a scalable framework that trains GUI agent in online environment. MobileGUI-RL contains two key components. It (i) synthesizes a curriculum of learnable tasks through self-exploration and filtering, and (ii) adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards that balance task success and execution efficiency. Experiments on three online mobile-agent benchmarks show consistent gains, validating the effectiveness of our approach.
中文: MobileGUI-RL是一个可扩展框架,通过自主探索的任务课程和优化的强化学习在在线环境中训练GUI智能体,在移动智能体基准测试中实现了稳定的性能提升。
English: MobileGUI-RL is a scalable framework that trains GUI agents in online environments through self-explored task curricula and adapted reinforcement learning, achieving consistent performance gains on mobile-agent benchmarks.
Authors:Zhang Liu, Lianfen Huang, Zhibin Gao, Xianbin Wang, Dusit Niyato, Xuemin, Shen
Abstract:
Low altitude uncrewed aerial vehicles (UAVs) are expected to facilitate the development of aerial-ground integrated intelligent transportation systems and unlocking the potential of the emerging low-altitude economy. However, several critical challenges persist, including the dynamic optimization of network resources and UAV trajectories, limited UAV endurance, and imperfect channel state information (CSI). In this paper, we offer new insights into low-altitude economy networking by exploring intelligent UAV-assisted vehicle-to-everything communication strategies aligned with UAV energy efficiency. Particularly, we formulate an optimization problem of joint channel allocation, power control, and flight altitude adjustment in UAV-assisted vehicular networks. Taking CSI feedback delay into account, our objective is to maximize the vehicle-to-UAV communication sum rate while satisfying the UAV's long-term energy constraint. To this end, we first leverage Lyapunov optimization to decompose the original long-term problem into a series of per-slot deterministic subproblems. We then propose a diffusion-based deep deterministic policy gradient (D3PG) algorithm, which innovatively integrates diffusion models to determine optimal channel allocation, power control, and flight altitude adjustment decisions. Through extensive simulations using real-world vehicle mobility traces, we demonstrate the superior performance of the proposed D3PG algorithm compared to existing benchmark solutions.
中文: 本文针对低空无人机辅助车辆网络中的关键挑战,提出了一种基于扩散模型的深度强化学习算法,在保障无人机长期能耗效率的同时优化通信性能。
English: This paper addresses key challenges in low-altitude UAV-assisted vehicular networks by proposing a diffusion-based deep reinforcement learning algorithm that optimizes communication performance while ensuring long-term UAV energy efficiency.
Authors:Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, Bo Qu, Wenhai Wang, Yu Qiao, Dajuin Yao, Yihao Liu
Abstract:
The rapid advancement of educational applications, artistic creation, and AI-generated content (AIGC) technologies has substantially increased practical requirements for comprehensive Image Aesthetics Assessment (IAA), particularly demanding methods capable of delivering both quantitative scoring and professional understanding. Multimodal Large Language Model (MLLM)-based IAA methods demonstrate stronger perceptual and generalization capabilities compared to traditional approaches, yet they suffer from modality bias (score-only or text-only) and lack fine-grained attribute decomposition, thereby failing to support further aesthetic assessment. In this paper, we present:(1) ArtiMuse, an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities; (2) ArtiMuse-10K, the first expert-curated image aesthetic dataset comprising 10,000 images spanning 5 main categories and 15 subcategories, each annotated by professional experts with 8-dimensional attributes analysis and a holistic score. Both the model and dataset will be made public to advance the field.
Chinese: 随着AI生成内容和教育应用的快速发展,对能够同时提供定量评分和专业理解的图像美学评估方法需求激增,而当前多模态大语言模型因模态偏差和细粒度分析不足,无法满足这一需求,为此本文提出了具备联合评分与专家级理解能力的创新模型ArtiMuse及首个专家策划数据集ArtiMuse-10K。
English: The rapid growth of AI-generated content and educational applications has heightened the need for advanced Image Aesthetics Assessment (IAA) methods that can provide both quantitative scoring and expert-level understanding, which current multimodal large language models lack due to modality bias and insufficient fine-grained analysis.
Authors:Hang Lv, Sheng Liang, Hao Wang, Hongchao Gu, Yaxiong Wu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen
Abstract:
Personalized text generation has become crucial for adapting language models to diverse and evolving users' personal context across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment, they struggle to achieve real-time adaptation under resource constraints inherent to personal devices. This limitation creates a dilemma: large cloud-based models lack access to localized user-specific information, while small on-device models cannot match the generation quality of their cloud counterparts. To address this dichotomy, we present CoSteer, a novel collaborative framework that enables decoding-time personalization through localized delta steering. Our key insight lies in leveraging the logits difference between personal context-aware and -agnostic outputs from local small models as steering signals for cloud-based LLMs. Specifically, we formulate token-level optimization as an online learning problem, where local delta vectors dynamically adjust the remote LLM's logits within the on-device environment. This approach preserves privacy by transmitting only the final steered tokens rather than raw data or intermediate vectors, while maintaining cloud-based LLMs' general capabilities without fine-tuning. Through comprehensive experiments on various personalized generation tasks, we demonstrate that CoSteer effectively assists LLMs in generating personalized content by leveraging locally stored user profiles and histories, ensuring privacy preservation through on-device data processing while maintaining acceptable computational overhead.
中文: CoSteer是一种创新协作框架,通过本地小型模型计算用户情境的差异信号来实时引导云端大语言模型生成个性化内容,既保护隐私又无需微调。
English: CoSteer is a collaborative framework that enables real-time personalized text generation by using local small models to compute delta signals from user context, which steer cloud-based large language models' outputs without compromising privacy or requiring fine-tuning.
Authors:Hang Lv, Sheng Liang, Hao Wang, Hongchao Gu, Yaxiong Wu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen
Abstract:
Personalized text generation has become crucial for adapting language models to diverse and evolving users' personal context across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment, they struggle to achieve real-time adaptation under resource constraints inherent to personal devices. This limitation creates a dilemma: large cloud-based models lack access to localized user-specific information, while small on-device models cannot match the generation quality of their cloud counterparts. To address this dichotomy, we present CoSteer, a novel collaborative framework that enables decoding-time personalization through localized delta steering. Our key insight lies in leveraging the logits difference between personal context-aware and -agnostic outputs from local small models as steering signals for cloud-based LLMs. Specifically, we formulate token-level optimization as an online learning problem, where local delta vectors dynamically adjust the remote LLM's logits within the on-device environment. This approach preserves privacy by transmitting only the final steered tokens rather than raw data or intermediate vectors, while maintaining cloud-based LLMs' general capabilities without fine-tuning. Through comprehensive experiments on various personalized generation tasks, we demonstrate that CoSteer effectively assists LLMs in generating personalized content by leveraging locally stored user profiles and histories, ensuring privacy preservation through on-device data processing while maintaining acceptable computational overhead.
中文: CoSteer是一种创新协作框架,通过本地小型模型计算用户情境的差异信号来实时引导云端大语言模型生成个性化内容,既保护隐私又无需微调。
English: CoSteer is a collaborative framework that enables real-time personalized text generation by using local small models to compute delta signals from user context, which steer cloud-based large language models' outputs without compromising privacy or requiring fine-tuning.
Authors:Gonzalo Mancera, Aythami Morales, Julian Fierrez, Ruben Tolosana, Alejandro Penna, Miguel Lopez-Duran, Francisco Jurado, Alvaro Ortigosa
Abstract:
The use of Natural Language Processing (NLP) in highstakes AI-based applications has increased significantly in recent years, especially since the emergence of Large Language Models (LLMs). However, despite their strong performance, LLMs introduce important legal/ ethical concerns, particularly regarding privacy, data protection, and transparency. Due to these concerns, this work explores the use of Named- Entity Recognition (NER) to facilitate the privacy-preserving training (or adaptation) of LLMs. We propose a framework that uses NER technologies to anonymize sensitive information in text data, such as personal identities or geographic locations. An evaluation of the proposed privacy-preserving learning framework was conducted to measure its impact on user privacy and system performance in a particular high-stakes and sensitive setup: AI-based resume scoring for recruitment processes. The study involved two language models (BERT and RoBERTa) and six anonymization algorithms (based on Presidio, FLAIR, BERT, and different versions of GPT) applied to a database of 24,000 candidate profiles. The findings indicate that the proposed privacy preservation techniques effectively maintain system performance while playing a critical role in safeguarding candidate confidentiality, thus promoting trust in the experimented scenario. On top of the proposed privacy-preserving approach, we also experiment applying an existing approach that reduces the gender bias in LLMs, thus finally obtaining our proposed Privacyand Bias-aware LLMs (PBa-LLMs). Note that the proposed PBa-LLMs have been evaluated in a particular setup (resume scoring), but are generally applicable to any other LLM-based AI application.
中文: 本研究提出了一种基于命名实体识别的隐私保护框架,通过匿名化大型语言模型中的敏感数据,在简历评分等高风险应用中既有效维护系统性能又保障数据机密性。
English: This research introduces a privacy-preserving framework using Named-Entity Recognition to anonymize sensitive data in Large Language Models, effectively maintaining system performance while safeguarding confidentiality in high-stakes applications like AI resume scoring.
Authors:Zuguang Li, Wen Wu, Shaohua Wu, Songge Zhang, Ye Wang, Xuemin, Shen
Abstract:
Split learning (SL) has emerged as a computationally efficient approach for artificial intelligence (AI) model training, which can alleviate device-side computational workloads. However, complex AI model architectures pose high computational complexity to obtain the optimal model splitting. In this paper, we represent an arbitrary AI model as a directed acyclic graph (DAG), and then reformulate the optimal model splitting problem as a minimum s-t cut search problem. To solve the problem, we propose a fast DAG-based model splitting algorithm, which restructures the DAG to enable the optimal model splitting identification via a maximum flow method. Theoretical analysis indicates that the proposed algorithm is optimal. Furthermore, considering AI models with block structures, we propose a block-wise model splitting algorithm to reduce computational complexity. The algorithm abstracts each block, i.e., a component consisting of multiple layers, into a single vertex, thereby obtaining the optimal model splitting via a simplified DAG. Extensive experimental results demonstrate that the proposed algorithms can determine the optimal model splitting within milliseconds, as well as reduce training delay by 24.62%-38.95% in dynamic edge networks as compared to the state-of-the-art benchmarks.
Chinese: 本文提出了一种基于有向无环图的快速模型分割算法,通过最大流方法实现最优模型分割,在边缘网络中显著降低了训练延迟。
English: This paper presents a fast DAG-based model splitting algorithm that optimally identifies split points in AI models using maximum flow methods, significantly reducing training delays in edge networks.
Authors:Chaoqun Yang, Xinyu Lin, Wenjie Wang, Yongqi Li, Teng Sun, Xianjing Han, Tat-Seng Chua
Abstract:
Large Language Model-based generative recommendation (LLMRec) has achieved notable success, but it suffers from high inference latency due to massive computational overhead and memory pressure of KV Cache. Existing KV Cache reduction methods face critical limitations: cache compression offers marginal acceleration given recommendation tasks' short decoding steps, while prompt compression risks discarding vital interaction history. Through systematic analysis of attention patterns in LLMRec, we uncover two pivotal insights: 1) layer-wise attention sparsity inversion where early layers retain dense informative patterns while later layers exhibit high redundancy, and 2) dual attention sinks phenomenon where attention scores concentrate on both head and tail tokens of input sequences. Motivated by these insights, we propose EARN, an efficient inference framework that leverages the early layers to compress information into register tokens placed at the input sequence boundaries, then focuses solely on these tokens in the subsequent layers. Extensive experiments on three datasets, two LLMRec methods and two LLM architectures demonstrate EARN's superiority, achieving up to 3.79x speedup and 80.8% KV Cache reduction with better accuracy than the general finetuning approach. Our work bridges the efficiency-effectiveness gap in LLMRec, offering practical deployment advantages for industrial scenarios.
中文摘要:EARN是一种高效推理框架,通过利用早期层注意力模式将信息压缩至输入序列边界的寄存器标记中,有效解决了基于大语言模型的生成推荐系统因KV缓存导致的高延迟问题,在保持精度的同时实现了显著加速和缓存削减。
English Summary: EARN is an efficient inference framework that addresses high latency in LLM-based recommendation systems by leveraging early-layer attention patterns to compress information into register tokens, achieving significant speedup and KV Cache reduction while maintaining accuracy.
Authors:Jian Yang, Wei Zhang, Shukai Liu, Linzheng Chai, Yingshui Tan, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou, Guanglin Niu, Zhoujun Li, Binyuan Hui, Junyang Lin
Abstract:
Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed requirements such as coding style, line count, and structural constraints, beyond mere correctness. To address this, the paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs in controlled code generation, ensuring outputs align more closely with human-defined guidelines. The authors further present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages (Python, Java, JavaScript, TypeScript, Shell, C++, and C#), with each sample featuring both Chinese and English queries. Unlike existing benchmarks, IFEvalCode decouples evaluation into two metrics: correctness (Corr.) and instruction-following (Instr.), enabling a more nuanced assessment. Experiments on over 40 LLMs reveal that closed-source models outperform open-source ones in controllable code generation and highlight a significant gap between the models' ability to generate correct code versus code that precisely follows instructions.
中文摘要:本文提出前向和后向约束生成方法以提升代码大模型在可控代码生成中的指令遵循能力,并推出多语言基准IFEvalCode,实验表明闭源模型优于开源模型,且模型生成正确代码与精确遵循指令之间存在显著差距。
English Summary: This paper introduces forward and backward constraints generation to enhance Code LLMs' instruction-following in controlled code generation and presents IFEvalCode, a multilingual benchmark revealing that closed-source models outperform open-source ones while showing a gap between generating correct code and precisely following instructions.
Authors:Yukai Shi, Jiarong Ou, Rui Chen, Haotian Yang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Kun Gai
Abstract:
In visual generation tasks, the responses and combinations of complex concepts often lack stability and are error-prone, which remains an under-explored area. In this paper, we attempt to explore the causal factors for poor concept responses through elaborately designed experiments. We also design a concept-wise equalization loss function (IMBA loss) to address this issue. Our proposed method is online, eliminating the need for offline dataset processing, and requires minimal code changes. In our newly proposed complex concept benchmark Inert-CompBench and two other public test sets, our method significantly enhances the concept response capability of baseline models and yields highly competitive results with only a few codes.
中文摘要:本文提出了一种概念均衡损失函数(IMBA损失),通过在线方式有效解决了视觉生成任务中复杂概念响应不稳定的问题,在基准测试中仅需少量代码修改即可显著提升模型性能。
English Summary: This paper introduces a concept-wise equalization loss function (IMBA loss) to address the instability and errors in complex concept responses for visual generation tasks, achieving significant improvements in benchmark tests with minimal code modifications.
Authors:Linzheng Chai, Jian Yang, Shukai Liu, Wei Zhang, Liran Wang, Ke Jin, Tao Sun, Congnan Liu, Chenchen Zhang, Hualei Zhu, Jiaheng Liu, Xianjie Wu, Ge Zhang, Tianyu Liu, Zhoujun Li
Abstract:
The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.
中文:MM-Coder作为多语言多模态软件开发工具,通过整合UML图等视觉工作流与文本指令来提升代码生成准确性,并引入新数据集和评估基准以突破纯文本模型的局限。
English: MM-Coder is a multimodal software developer that integrates visual workflows like UML diagrams with text instructions to enhance code generation, addressing limitations of text-only models through a new dataset and benchmark.
Authors:Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou
Abstract:
Current AI agents cannot effectively learn from each other's problem-solving experiences or use past successes to guide self-reflection and error correction in new tasks. We introduce Agent KB, a shared knowledge base that captures both high-level problem-solving strategies and detailed execution lessons, enabling knowledge transfer across agent frameworks. Agent KB implements a novel teacher-student dual-phase retrieval mechanism where student agents retrieve workflow-level patterns for strategic guidance while teacher agents identify execution-level patterns for refinement. This hierarchical approach enables agents to break out of limited reasoning pathways by incorporating diverse strategies from external sources. Evaluations on the GAIA benchmark demonstrate substantial performance gains, with Agent KB improving success rates by up to 6.06 percentage points overall under pass@1. For SWE-bench code repair tasks, our system significantly improved resolution rates, with o3-mini achieving an 8.67 percentage point gain (23 percent to 31.67 percent) in pass@1. Our ablation studies demonstrate that the refinement module proves most critical, with its removal causing a 3.85% drop on challenging Level 3 tasks, highlighting that effective knowledge transfer necessitates both strategic guidance and execution-level refinement.
Chinese: Agent KB通过共享知识库和双阶段检索机制,使AI智能体能够在不同框架间传递问题解决策略与执行经验,结合战略指导和细节优化,在GAIA和SWE-bench基准测试中显著提升了任务成功率。
English: Agent KB introduces a shared knowledge base with a dual-phase retrieval mechanism that enables AI agents to transfer problem-solving strategies and execution lessons across frameworks, significantly improving success rates on benchmarks like GAIA and SWE-bench through combined strategic guidance and refinement.
Authors:Zhicheng Zhang, Wuyou Xia, Chenxi Zhao, Zhou Yan, Xiaoqiang Liu, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang
Abstract:
Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by a generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks. Source code and demo are available in https://zzcheng.top/MODA.
中文: 本文提出MODA新型注意力机制,通过同时进行模态内优化和模态间交互来解决多模态学习中的注意力缺陷问题,并在21个基准数据集上得到实验验证。
English: This paper introduces MODA, a novel attention mechanism that addresses attention deficit in multimodal learning by enabling simultaneous inner-modal refinement and inter-modal interaction, with experimental validation across 21 benchmark datasets.
Authors:Amit Das, Md. Najib Hasan, Souvika Sarkar, Zheng Zhang, Fatemeh Jamshidi, Tathagata Bhattacharya, Nilanjana Raychawdhury, Dongji Feng, Vinija Jain, Aman Chadha
Abstract:
Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.
中文:多语言研究表明,大型语言模型在印地语和波斯语中的幻觉现象显著多于普通话,该结论基于对六种主流模型的跨语言事实性与语言学错误分析。
English: Large language models exhibit significantly higher hallucination rates in Hindi and Farsi compared to Mandarin, as revealed by a multilingual study analyzing factual and linguistic errors across six major LLMs.
Authors:Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan, Yujun Cai
Abstract:
Img2LaTeX is a practically significant task that involves converting mathematical expressions or tabular data from images into LaTeX code. In recent years, vision-language models (VLMs) have demonstrated strong performance across a variety of visual understanding tasks, owing to their generalization capabilities. While some studies have explored the use of VLMs for the Img2LaTeX task, their performance often falls short of expectations. Empirically, VLMs sometimes struggle with fine-grained visual elements, leading to inaccurate LaTeX predictions. To address this challenge, we propose $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement, a framework that effectively integrates attention localization and iterative refinement within a visual reasoning framework, enabling VLMs to perform self-correction and progressively improve prediction quality. For effective evaluation, we introduce a new dataset, Img2LaTex-Hard-1K, consisting of 1,100 carefully curated and challenging examples designed to rigorously evaluate the capabilities of VLMs within this task domain. Extensive experimental results demonstrate that: (1) $A^2R^2$ significantly improves model performance across six evaluation metrics spanning both textual and visual levels, consistently outperforming other baseline methods; (2) Increasing the number of inference rounds yields notable performance gains, underscoring the potential of $A^2R^2$ in test-time scaling scenarios; (3) Ablation studies and human evaluations validate the practical effectiveness of our approach, as well as the strong synergy among its core components during inference.
中文:提出的$A^2R^2$框架通过结合注意力引导的视觉推理和迭代优化,显著提升了图像转LaTeX的转换精度,并在新构建的困难数据集上验证了其有效性。
English: The proposed $A^2R^2$ framework enhances Img2LaTeX conversion by integrating attention-guided visual reasoning and iterative refinement, significantly improving accuracy on a new challenging dataset.
Authors:Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan, Yujun Cai
Abstract:
Img2LaTeX is a practically important task that involves translating mathematical expressions and structured visual content from images into LaTeX code. In recent years, vision-language models (VLMs) have achieved remarkable progress across a range of visual understanding tasks, largely due to their strong generalization capabilities. However, despite initial efforts to apply VLMs to the Img2LaTeX task, their performance remains suboptimal. Empirical evidence shows that VLMs can be challenged by fine-grained visual elements, such as subscripts and superscripts in mathematical expressions, which results in inaccurate LaTeX generation. To address this challenge, we propose $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement, a framework that effectively integrates attention localization and iterative refinement within a visual reasoning framework, enabling VLMs to perform self-correction and progressively improve LaTeX generation quality. For effective evaluation, we introduce a new dataset, Img2LaTex-Hard-1K, consisting of 1,100 carefully curated and challenging examples designed to rigorously evaluate the capabilities of VLMs within this task domain. Extensive experimental results demonstrate that: (1) $A^2R^2$ significantly improves model performance across various evaluation metrics spanning both textual and visual levels; (2) Increasing the number of inference rounds yields notable performance gains, underscoring the potential of $A^2R^2$ in test-time scaling scenarios; (3) Ablation studies and further evaluations confirm the effectiveness of our approach and the synergy of its core components during inference.
中文:提出的$A^2R^2$框架通过结合注意力引导的视觉推理和迭代优化,显著提升了图像转LaTeX的转换精度,并在新构建的困难数据集上验证了其有效性。
English: The proposed $A^2R^2$ framework enhances Img2LaTeX conversion by integrating attention-guided visual reasoning and iterative refinement, significantly improving accuracy on a new challenging dataset.
Authors:Shuaiyu Zhou, Zhengran Zeng, Xiaoling Zhou, Rui Xie, Shikun Zhang, Wei Ye
Abstract:
Unit tests play a vital role in the software development lifecycle. Recent advances in Large Language Model (LLM)-based approaches have significantly improved automated test generation, garnering attention from both academia and industry. We revisit LLM-based unit test generation from a novel perspective by decoupling prefix generation and assertion generation. To characterize their respective challenges, we define Initialization Complexity and adopt Cyclomatic Complexity to measure the difficulty of prefix and assertion generation, revealing that the former primarily affects compilation success, while the latter influences test coverage. To address these challenges, we propose Seed&Steer, a two-step approach that combines traditional unit testing techniques with the capabilities of large language models. Seed&Steer leverages conventional unit testing tools (e.g., EvoSuite) to generate method invocations with high compilation success rates, which serve as seeds to guide LLMs in constructing effective test contexts. It then introduces branching cues to help LLMs explore diverse execution paths (e.g., normal, boundary, and exception cases) and generate assertions with high coverage. We evaluate Seed&Steer on five real-world Java projects against state-of-the-art baselines. Results show that Seed&Steer improves the compilation pass rate by approximately 7%, successfully compiling 792 and 887 previously failing cases on two LLMs. It also achieves up to ~73% branch and line coverage across focal methods of varying complexity, with coverage improvements ranging from 1.09* to 1.26*. Our code, dataset, and experimental scripts will be publicly released to support future research and reproducibility.
Chinese: 本文提出Seed&Steer方法,通过将传统单元测试工具与大型语言模型相结合,采用解耦的前缀生成和断言生成两步策略,显著提升了测试编译通过率和代码覆盖率。
English: This paper introduces Seed&Steer, a novel two-step approach that combines traditional unit testing tools with large language models to enhance automated test generation by improving compilation success and test coverage through decoupled prefix and assertion generation.
Authors:Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu
Abstract:
The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1400 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.
中文摘要:本综述将语境工程确立为系统优化大语言模型信息输入的新兴学科,揭示了当前模型在复杂语境理解与长文本生成能力之间的显著不对称性,这一研究缺口将成为未来领域的核心攻关方向。
English Summary: This survey establishes Context Engineering as a systematic discipline for optimizing information provided to LLMs, revealing a critical gap between their advanced context understanding and limited long-form generation capabilities that defines future research priorities.
Authors:Xiaowei Yu, Jing Zhang, Tong Chen, Yan Zhuang, Minheng Chen, Chao Cao, Yanjun Lyu, Lu Zhang, Li Su, Tianming Liu, Dajiang Zhu
Abstract:
Lewy Body Disease (LBD) is a common yet understudied form of dementia that imposes a significant burden on public health. It shares clinical similarities with Alzheimer's disease (AD), as both progress through stages of normal cognition, mild cognitive impairment, and dementia. A major obstacle in LBD diagnosis is data scarcity, which limits the effectiveness of deep learning. In contrast, AD datasets are more abundant, offering potential for knowledge transfer. However, LBD and AD data are typically collected from different sites using different machines and protocols, resulting in a distinct domain shift. To effectively leverage AD data while mitigating domain shift, we propose a Transferability Aware Transformer (TAT) that adapts knowledge from AD to enhance LBD diagnosis. Our method utilizes structural connectivity (SC) derived from structural MRI as training data. Built on the attention mechanism, TAT adaptively assigns greater weights to disease-transferable features while suppressing domain-specific ones, thereby reducing domain shift and improving diagnostic accuracy with limited LBD data. The experimental results demonstrate the effectiveness of TAT. To the best of our knowledge, this is the first study to explore domain adaptation from AD to LBD under conditions of data scarcity and domain shift, providing a promising framework for domain-adaptive diagnosis of rare diseases.
中文摘要:本研究提出一种可迁移性感知变压器(TAT),通过抑制领域特异性特征并增强疾病可迁移特征,有效利用阿尔茨海默病数据来提升路易体痴呆的诊断准确性,为解决数据稀缺和领域偏移问题提供了创新框架。
English Summary: This study introduces a Transferability Aware Transformer (TAT) that effectively adapts knowledge from Alzheimer's disease data to improve Lewy Body Disease diagnosis by mitigating domain shift and leveraging transferable features, demonstrating promising results under data scarcity.
Authors:Minghang Zhu, Shen Gao, Zhengliang Shi, Jiabao Fang, Pengjie Ren, Zhaochun Ren, Zhumin Chen, Shuo Shang
Abstract:
A common training approach for language models involves using a large-scale language model to expand a human-provided dataset, which is subsequently used for model training.This method significantly reduces training costs by eliminating the need for extensive human data annotation. However, it still faces challenges such as high carbon emissions during data augmentation and the risk of data leakage when we use closed-source LLMs. To address these issues, we propose a self-evolution method for language models. First, we introduce the Multi-level Principle Generation, which enables a large-scale model to summarize task-completion principles based on a small amount of task data. Then, we propose the Principle-based Instance Generation, in which a smaller-scale language model uses these task principles to generate a large amount of data. This data is then used for model training. Experimental results show that our proposed method significantly improves model performance compared to directly using a smaller-scale language model to generate data. Additionally, since we only use the large-scale language model to generate the task-completion principles, the carbon emissions associated with training the model are greatly reduced.
中文摘要:本文提出一种语言模型自我进化方法,通过大模型生成任务原则、小模型产生训练数据,在显著提升模型性能的同时,相比传统数据增强方法大幅降低了碳排放。
English Summary: This paper introduces a self-evolution method for language models that uses a large model to generate task principles and a smaller model to produce training data, significantly improving performance while reducing carbon emissions compared to traditional data augmentation approaches.
Authors:Zhen Xiong, Yujun Cai, Zhecheng Li, Yiwei Wang
Abstract:
Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs). We present a theoretical analysis comparing autoregressive and masked diffusion LLMs, revealing that the intrinsic bidirectional attention mechanism of diffusion LLMs (dLLMs) enables superior context modeling and generation controllability. However, existing dLLM applications face significant challenges in controllable generation: the native multi-step denoising process exhibits high sensitivity to sequence length, elevated hallucination rates, and prohibitive inference costs without specialized optimizations. To address these limitations, we propose \textbf{S}elf-adaptive \textbf{S}chema \textbf{S}caffolding ($S^3$), a novel framework that enables dLLMs to generate structured outputs (e.g., JSON) while maintaining semantic fidelity and accelerating inference. Our approach injects the target schema structure into the output context, reducing unnecessary computation while improving controllability. Extensive experiments demonstrate that $S^3$ achieves substantial improvements: 65\% increase in structural adherence, 48\% enhancement in content fidelity, and 17\% reduction in hallucination rates compared to baseline. These results establish both theoretical foundations and practical pathways for deploying diffusion models in controllable text generation tasks. Code and data will be publicly released.
中文摘要:本研究提出自适应模式支架(S³)框架,利用扩散大语言模型的反向推理能力稳定生成结构化输出,为可控文本生成开辟了新途径。
English Summary: The study introduces Self-adaptive Schema Scaffolding (S³), a novel framework leveraging diffusion-based LLMs' reverse reasoning to reliably generate structured outputs, establishing new pathways for controllable generation.
Authors:Zhen Xiong, Yujun Cai, Zhecheng Li, Yiwei Wang
Abstract:
Controllable generation is a fundamental task in NLP with many applications, providing a basis for function calling to agentic communication. However, even state-of-the-art autoregressive Large Language Models (LLMs) today exhibit unreliability when required to generate structured output. Inspired by the current new diffusion-based large language models (dLLM), we realize that the architectural difference, especially the global information-sharing mechanism for language modeling, may be the key to unlock next-level controllable generation. To explore the possibility, we propose Self-adaptive Schema Scaffolding ($S^3$), a novel framework that enables dLLM to stably generate reliable structured outputs (e.g., JSON) by utilizing its innate reverse reasoning capability and global context awareness. $S^3$ initiates a schematic template directly in the output context as a starting state for dLLM, offering a more robust and general method than intricate prompt optimization. Experiments demonstrate that our method substantially unlocks the dLLM's potential in controllable generation in terms of structure adherence, content fidelity, and faithfulness. These results establish new perspectives and practical pathways for deploying language models in controllable generation tasks.
中文摘要:本研究提出自适应模式支架(S³)框架,利用扩散大语言模型的反向推理能力稳定生成结构化输出,为可控文本生成开辟了新途径。
English Summary: The study introduces Self-adaptive Schema Scaffolding (S³), a novel framework leveraging diffusion-based LLMs' reverse reasoning to reliably generate structured outputs, establishing new pathways for controllable generation.
Authors:Georgios Ioannides, Christos Constantinou, Vinija Jain, Aman Chadha, Aaron Elkins
Abstract:
As Artificial Intelligence systems evolve from monolithic models to ecosystems of specialized agents, the need for standardized communication protocols becomes increasingly critical. This paper introduces MOD-X (Modular Open Decentralized eXchange), a novel architectural framework proposal for agent interoperability that addresses key limitations of existing protocols. Unlike current approaches, MOD-X proposes a layered architecture with a Universal Message Bus, thorough state management, translation capabilities, and blockchain-based security mechanisms. We present MOD-X's architecture, compare it with existing protocols, and demonstrate its application through a worked example how it enables integration between heterogeneous specialist agents (agents with different architectures, vendors, capabilities, and knowledge representations--including rule-based systems, neural networks, symbolic reasoning engines, and legacy software with agent wrappers). MOD-X's key innovations include a publish-subscribe communication model, semantic capability discovery, and dynamic workflow orchestration--providing a framework that bridges theoretical formalism with practical implementation. This architecture addresses the growing need for truly decentralized, interoperable agent ecosystems that can scale effectively without the need for central coordination.
中文摘要:本文提出MOD-X这一新颖的分层架构框架,通过标准化通信协议、去中心化协调机制和集成安全特性,旨在实现异构人工智能代理之间的互操作性。
English Summary: This paper introduces MOD-X, a novel layered architectural framework designed to enable interoperability between heterogeneous AI agents through standardized communication protocols, decentralized coordination, and integrated security features.
Authors:Vipula Rawte, Rajarshi Roy, Gurpreet Singh, Danush Khanna, Yaswanth Narsupalli, Basab Ghosh, Abhay Gupta, Argha Kamal Samanta, Aditya Shingote, Aadi Krishna Vikram, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das
Abstract:
As Large Language Models (LLMs) continue to advance, Retrieval-Augmented Generation (RAG) has emerged as a vital technique to enhance factual accuracy by integrating external knowledge into the generation process. However, LLMs often fail to faithfully integrate retrieved evidence into their generated responses, leading to factual inconsistencies. To quantify this gap, we introduce Entity-Context Divergence (ECD), a metric that measures the extent to which retrieved information is accurately reflected in model outputs. We systematically evaluate contemporary LLMs on their ability to preserve factual consistency in retrieval-augmented settings, a capability we define as RAG-ability. Our empirical analysis reveals that RAG-ability remains low across most LLMs, highlighting significant challenges in entity retention and context fidelity. This paper introduces Radiant (Retrieval AugmenteD entIty-context AligNmenT), a novel framework that merges RAG with alignment designed to optimize the interplay between retrieved evidence and generated content. Radiant extends Direct Preference Optimization (DPO) to teach LLMs how to integrate provided additional information into subsequent generations. As a behavior correction mechanism, Radiant boosts RAG performance across varied retrieval scenarios, such as noisy web contexts, knowledge conflicts, and hallucination reduction. This enables more reliable, contextually grounded, and factually coherent content generation.
中文: 本文提出实体-上下文差异来衡量大语言模型在检索增强生成中的事实不一致性,并介绍Radiant框架,该框架通过优化对齐技术将检索证据与生成内容相结合,从而提升RAG性能。
English: The paper introduces Entity-Context Divergence to measure factual inconsistencies in LLMs using Retrieval-Augmented Generation and proposes Radiant, a framework that enhances RAG performance by aligning retrieved evidence with generated content through optimized alignment techniques.
Authors:Nifu Dan, Yujun Cai, Yiwei Wang
Abstract:
Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark. Our comprehensive experimental evaluation reveals the remarkable capabilities of reasoning models. Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation. Furthermore, our findings indicate that even for these highly sophisticated reasoning models, the strategic incorporation of few-shot prompting can still yield measurable improvements in overall accuracy, highlighting the potential for continued performance gains.
中文摘要:先进推理模型在复杂物理问题上展现出卓越的准确性和独特的符号推理模式,少量示例提示策略还能持续提升其解题能力。
English Summary: Advanced reasoning models like Deepseek-R1 demonstrate exceptional accuracy and unique symbolic reasoning patterns on complex physics problems, with few-shot prompting further enhancing their performance.
Authors:Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang
Abstract:
Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements, allowing the model to reason over each modality independently using general-purpose vision-language models. When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions centered on the model's initial predictions and incrementally zooms into subregions to refine the grounding result. This hierarchical refinement process helps disambiguate visually crowded layouts without the need for additional training or annotations. We evaluate our approach on standard GUI grounding benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation with region-focused reasoning.
中文: DiMo-GUI是一种无需训练的图形界面定位框架,通过动态聚焦视觉区域并优化文本与图标模态处理,有效提升复杂界面中的定位精度且无需额外训练。
English: DiMo-GUI is a training-free framework that enhances GUI grounding by dynamically focusing on visual regions and optimizing across text and icon modalities, improving accuracy in cluttered interfaces without additional training.
Authors:Tobias Rueckert, David Rauber, Raphaela Maerkl, Leonard Klausmann, Suemeyye R. Yildiran, Max Gutbrod, Danilo Weber Nunes, Alvaro Fernandez Moreno, Imanol Luengo, Danail Stoyanov, Nicolas Toussaint, Enki Cho, Hyeon Bae Kim, Oh Sung Choo, Ka Young Kim, Seong Tae Kim, Gonçalo Arantes, Kehan Song, Jianjun Zhu, Junchen Xiong, Tingyi Lin, Shunsuke Kikuchi, Hiroki Matsuzaki, Atsushi Kouno, João Renato Ribeiro Manesco, João Paulo Papa, Tae-Min Choi, Tae Kyeong Jeong, Juyoun Park, Oluwatosin Alabi, Meng Wei, Tom Vercauteren, Runzhi Wu, Mengya Xu, An Wang, Long Bai, Hongliang Ren, Amine Yamlahi, Jakob Hennighausen, Lena Maier-Hein, Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa, Shu Yang, Yihui Wang, Hao Chen, Santiago RodrÃguez, Nicolás Aparicio, Leonardo Manrique, Juan Camilo Lyons, Olivia Hosie, Nicolás Ayobi, Pablo Arbeláez, Yiping Li, Yasmina Al Khalil, Sahar Nasirihaghighi, Stefanie Speidel, Daniel Rueckert, Hubertus Feussner, Dirk Wilhelm, Christoph Palm
Abstract:
Reliable recognition and localization of surgical instruments in endoscopic video recordings are foundational for a wide range of applications in computer- and robot-assisted minimally invasive surgery (RAMIS), including surgical training, skill assessment, and autonomous assistance. However, robust performance under real-world conditions remains a significant challenge. Incorporating surgical context - such as the current procedural phase - has emerged as a promising strategy to improve robustness and interpretability.
To address these challenges, we organized the Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) sub-challenge as part of the Endoscopic Vision (EndoVis) challenge at MICCAI 2024. We introduced a novel, multi-center dataset comprising thirteen full-length laparoscopic cholecystectomy videos collected from three distinct medical institutions, with unified annotations for three interrelated tasks: surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation. Unlike existing datasets, ours enables joint investigation of instrument localization and procedural context within the same data while supporting the integration of temporal information across entire procedures.
We report results and findings in accordance with the BIAS guidelines for biomedical image analysis challenges. The PhaKIR sub-challenge advances the field by providing a unique benchmark for developing temporally aware, context-driven methods in RAMIS and offers a high-quality resource to support future research in surgical scene understanding.
中文:PhaKIR子挑战通过提供包含统一标注的多中心数据集,支持手术阶段识别、器械关键点估计和分割的联合研究,以推动机器人辅助手术中情境感知方法的发展。
English: The PhaKIR sub-challenge introduces a novel multi-center dataset with unified annotations for surgical phase recognition, instrument keypoint estimation, and segmentation to advance context-driven methods in robot-assisted surgery.
Authors:Rui Tang, Haochen Yin, Guankun Wang, Long Bai, An Wang, Huxin Gao, Jiazheng Wang, Hongliang Ren
Abstract:
Surgical phase recognition plays a critical role in developing intelligent assistance systems for minimally invasive procedures such as Endoscopic Submucosal Dissection (ESD). However, the high visual similarity across different phases and the lack of structural cues in RGB images pose significant challenges. Depth information offers valuable geometric cues that can complement appearance features by providing insights into spatial relationships and anatomical structures. In this paper, we pioneer the use of depth information for surgical phase recognition and propose Geo-RepNet, a geometry-aware convolutional framework that integrates RGB image and depth information to enhance recognition performance in complex surgical scenes. Built upon a re-parameterizable RepVGG backbone, Geo-RepNet incorporates the Depth-Guided Geometric Prior Generation (DGPG) module that extracts geometry priors from raw depth maps, and the Geometry-Enhanced Multi-scale Attention (GEMA) to inject spatial guidance through geometry-aware cross-attention and efficient multi-scale aggregation. To evaluate the effectiveness of our approach, we construct a nine-phase ESD dataset with dense frame-level annotations from real-world ESD videos. Extensive experiments on the proposed dataset demonstrate that Geo-RepNet achieves state-of-the-art performance while maintaining robustness and high computational efficiency under complex and low-texture surgical environments.
中文摘要:本文提出Geo-RepNet几何感知框架,通过融合RGB图像与深度信息解决手术阶段识别中的视觉相似性难题,在真实ESD数据集上实现最优性能并保持高效计算。
English Summary: This paper introduces Geo-RepNet, a geometry-aware framework that integrates RGB and depth information to overcome visual similarity challenges in surgical phase recognition, achieving state-of-the-art performance on a real-world ESD dataset while maintaining computational efficiency.
Authors:Jiajia Guo, Chao-Kai Wen, Xiao Li, Shi Jin
Abstract:
To reduce channel acquisition overhead, spatial, time, and frequency-domain channel extrapolation techniques have been widely studied. In this paper, we propose a novel deep learning-based Position-domain Channel Extrapolation framework (named PCEnet) for cell-free massive multiple-input multiple-output (MIMO) systems. The user's position, which contains significant channel characteristic information, can greatly enhance the efficiency of channel acquisition. In cell-free massive MIMO, while the propagation environments between different base stations and a specific user vary and their respective channels are uncorrelated, the user's position remains constant and unique across all channels. Building on this, the proposed PCEnet framework leverages the position as a bridge between channels to establish a mapping between the characteristics of different channels, thereby using one acquired channel to assist in the estimation and feedback of others. Specifically, this approach first utilizes neural networks (NNs) to infer the user's position from the obtained channel. {The estimated position, shared among BSs through a central processing unit (CPU)}, is then fed into an NN to design pilot symbols and concatenated with the feedback information to the channel reconstruction NN to reconstruct other channels, thereby significantly enhancing channel acquisition performance. Additionally, we propose a simplified strategy where only the estimated position is used in the reconstruction process without modifying the pilot design, thereby reducing latency. Furthermore, we introduce a position label-free approach that infers the relative user position instead of the absolute position, eliminating the need for ground truth position labels during the localization NN training. Simulation results demonstrate that the proposed PCEnet framework reduces pilot and feedback overheads by up to 50%.
中文: 本文提出PCEnet深度学习框架,通过利用用户位置作为不同信道间的桥梁进行信道外推,在无蜂窝大规模MIMO系统中采用位置感知的信道重建技术,将导频和反馈开销降低了高达50%。
English: This paper introduces PCEnet, a deep learning framework that leverages user position as a bridge to extrapolate channels in cell-free massive MIMO systems, reducing pilot and feedback overhead by up to 50% through innovative position-aware channel reconstruction techniques.
Authors:Xinjie Li, Xingyu Zhou, Yixiao Cao, Jing Zhang, Chao-Kai Wen, Xiao Li, Shi Jin
Abstract:
The superimposed pilot transmission scheme offers substantial potential for improving spectral efficiency in MIMO-OFDM systems, but it presents significant challenges for receiver design due to pilot contamination and data interference. To address these issues, we propose an advanced iterative receiver based on joint channel estimation, detection, and decoding, which refines the receiver outputs through iterative feedback. The proposed receiver incorporates two adaptive channel estimation strategies to enhance robustness under time-varying and mismatched channel conditions. First, a variational message passing (VMP) method and its low-complexity variant (VMP-L) are introduced to perform inference without relying on time-domain correlation. Second, a deep learning (DL) based estimator is developed, featuring a convolutional neural network with a despreading module and an attention mechanism to extract and fuse relevant channel features. Extensive simulations under multi-stream and high-mobility scenarios demonstrate that the proposed receiver consistently outperforms conventional orthogonal pilot baselines in both throughput and block error rate. Moreover, over-the-air experiments validate the practical effectiveness of the proposed design. Among the methods, the DL based estimator achieves a favorable trade-off between performance and complexity, highlighting its suitability for real-world deployment in dynamic wireless environments.
中文: 本文提出了一种用于MIMO-OFDM系统的先进迭代接收机,采用联合信道估计、检测与解码及自适应策略,包括变分消息传递和基于深度学习的估计器,在仿真和实际测试中均展现出优于传统方法的吞吐量和误码率性能。
English: This paper introduces an advanced iterative receiver for MIMO-OFDM systems that employs joint channel estimation, detection, and decoding with adaptive strategies, including variational message passing and a deep learning-based estimator, demonstrating superior throughput and error rate performance over conventional methods in simulations and real-world tests.
Authors:Hang Que, Jie Yang, Tao Du, Shuqiang Xia, Chao-Kai Wen, Shi Jin
Abstract:
Simultaneous localization and mapping (SLAM) plays a critical role in integrated sensing and communication (ISAC) systems for sixth-generation (6G) millimeter-wave (mmWave) networks, enabling environmental awareness and precise user equipment (UE) positioning. While cooperative multi-user SLAM has demonstrated potential in leveraging distributed sensing, its application within multi-modal ISAC systems remains limited, particularly in terms of theoretical modeling and communication-layer integration. This paper proposes a novel multi-modal SLAM framework that addresses these limitations through three key contributions. First, a Bayesian estimation framework is developed for cooperative multi-user SLAM, along with a two-stage algorithm for robust radio map construction under dynamic and heterogeneous sensing conditions. Second, a multi-modal localization strategy is introduced, fusing SLAM results with camera-based multi-object tracking and inertial measurement unit (IMU) data via an error-aware model, significantly improving UE localization in multi-user scenarios. Third, a sensing-aided beam management scheme is proposed, utilizing global radio maps and localization data to generate UE-specific prior information for beam selection, thereby reducing inter-user interference and enhancing downlink spectral efficiency. Simulation results demonstrate that the proposed system improves radio map accuracy by up to 60%, enhances localization accuracy by 37.5%, and significantly outperforms traditional methods in both indoor and outdoor environments.
中文: 本文提出了一种面向6G毫米波网络的多模态SLAM框架,通过贝叶斯估计、多传感器融合和感知辅助波束管理,显著提升了协同多用户定位的精度和无线电地图构建质量。
English: This paper introduces a multi-modal SLAM framework for 6G mmWave networks that enhances cooperative multi-user positioning through Bayesian estimation, sensor fusion, and sensing-aided beam management, achieving significant improvements in radio map accuracy and localization performance.
Authors:Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You
Abstract:
Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.
中文: LoongX利用多模态神经信号实现无需手动操作的图像编辑,通过扩散模型和跨模态融合技术,其性能媲美文本驱动方法,为认知驱动的创意技术开辟了新方向。
English: LoongX introduces a hands-free image editing system using multimodal neurophysiological signals like EEG and fNIRS, achieving performance comparable to text-based methods through advanced diffusion models and cross-modal fusion techniques.
Authors:Jiajia Guo, Peiwen Jiang, Chao-Kai Wen, Shi Jin, Jun Zhang
Abstract:
Accurate channel state information (CSI) is critical to the performance of wireless communication systems, especially with the increasing scale and complexity introduced by 5G and future 6G technologies. While artificial intelligence (AI) offers a promising approach to CSI acquisition and utilization, existing methods largely depend on task-specific neural networks (NNs) that require expert-driven design and large training datasets, limiting their generalizability and practicality. To address these challenges, we propose LVM4CSI, a general and efficient framework that leverages the structural similarity between CSI and computer vision (CV) data to directly apply large vision models (LVMs) pre-trained on extensive CV datasets to wireless tasks without any fine-tuning, in contrast to large language model-based methods that generally necessitate fine-tuning. LVM4CSI maps CSI tasks to analogous CV tasks, transforms complex-valued CSI into visual formats compatible with LVMs, and integrates lightweight trainable layers to adapt extracted features to specific communication objectives. We validate LVM4CSI through three representative case studies, including channel estimation, human activity recognition, and user localization. Results demonstrate that LVM4CSI achieves comparable or superior performance to task-specific NNs, including an improvement exceeding 9.61 dB in channel estimation and approximately 40% reduction in localization error. Furthermore, it significantly reduces the number of trainable parameters and eliminates the need for task-specific NN design.
中文:LVM4CSI是一种创新框架,通过将信道状态信息转换为视觉格式,将预训练的大型视觉模型直接应用于无线通信任务,无需特定任务神经网络或微调即可实现卓越性能。
English: LVM4CSI is a novel framework that applies pre-trained large vision models to wireless communication tasks by converting channel state information into visual formats, achieving superior performance without task-specific neural networks or fine-tuning.
Authors:Jie Yang, Chao-Kai Wen, Xiao Li, Shi Jin
Abstract:
Integrated sensing and communication (ISAC) is a key feature of future cellular systems, enabling applications such as intruder detection, monitoring, and tracking using the same infrastructure. However, its potential for structural health monitoring (SHM), which requires the detection of slow and subtle structural changes, remains largely unexplored due to challenges such as multipath interference and the need for ultra-high sensing precision. This study introduces a novel theoretical framework for SHM via ISAC by leveraging reconfigurable intelligent surfaces (RIS) as reference points in collaboration with base stations and users. By dynamically adjusting RIS phases to generate distinct radio signals that suppress background multipath interference, measurement accuracy at these reference points is enhanced. We theoretically analyze RIS-aided collaborative sensing in three-dimensional cellular networks using Fisher information theory, demonstrating how increasing observation time, incorporating additional receivers (even with self-positioning errors), optimizing RIS phases, and refining collaborative node selection can reduce the position error bound to meet SHM's stringent accuracy requirements. Furthermore, we develop a Bayesian inference model to identify structural states and validate damage detection probabilities. Both theoretical and numerical analyses confirm ISAC's capability for millimeter-level deformation detection, highlighting its potential for high-precision SHM applications.
中文: 本研究提出一种新型RIS辅助的集成感知通信框架,通过抑制多径干扰和提升测量精度,结合优化信号处理与协同感知机制,实现了毫米级形变检测的结构健康监测能力。
English: This study proposes a novel RIS-assisted ISAC framework that enhances structural health monitoring by suppressing multipath interference and improving measurement precision, enabling millimeter-level deformation detection through optimized signal processing and collaborative sensing.
Authors:Weicong Chen, Chao-Kai Wen, Wankai Tang, Xiao Li, Shi Jin
Abstract:
Reconfigurable intelligent surfaces (RISs) offer the unique capability to reshape the radio environment, thereby simplifying transmission schemes traditionally contingent on channel conditions. Joint spatial division and multiplexing (JSDM) emerges as a low-overhead transmission scheme for multi-user equipment (UE) scenarios, typically requiring complex matrix decomposition to achieve block-diagonalization of the effective channel matrix. In this study, we introduce an innovative JSDM design that leverages RISs to customize channels, thereby streamlining the overall procedures. By strategically positioning RISs at the discrete Fourier transform (DFT) directions of the base station (BS), we establish orthogonal line-of-sight links within the BS-RIS channel, enabling a straightforward pre-beamforming design. Based on UE grouping, we devise reflected beams of the RIS with optimized directions to mitigate inter-group interference in the RISs-UEs channel. An approximation of the channel cross-correlation coefficient is derived and serves as a foundation for the RISs-UEs association, further diminishing inter-group interference. Numerical results substantiate the efficacy of our RIS-customized JSDM in not only achieving effective channel block-diagonalization but also in significantly enhancing the sum spectral efficiency for multi-UE transmissions.
中文: 本研究提出了一种创新的联合空间分割与复用设计,利用可重构智能表面定制信道,通过建立正交视距链路和优化波束成形来简化流程并有效提升多用户传输的总频谱效率。
English: This study introduces a novel JSDM design that utilizes reconfigurable intelligent surfaces to customize channels, simplifying procedures and enhancing spectral efficiency by establishing orthogonal links and mitigating interference through optimized beamforming.
Authors:Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, Heng Tao Shen
Abstract:
By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment.Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs' built-in safeguards.Yet, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal jailbreaks, often exhibiting overdefensive behaviors and imposing heavy training overhead.To bridge this gap, we present an comprehensive analysis of where, how and which harmful multimodal tokens bypass safeguards in MLLMs. Surprisingly, we find that less than 1% tokens in early-middle layers are responsible for inducing unsafe behaviors, highlighting the potential of precisely removing a small subset of harmful tokens, without requiring safety tuning, can still effectively improve safety against jailbreaks. Motivated by this, we propose Safe Prune-then-Restore (SafePTR), an training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent layers.Without incurring additional computational overhead, SafePTR significantly enhances the safety of MLLMs while preserving efficiency. Extensive evaluations across three MLLMs and five benchmarks demonstrate SafePTR's state-of-the-art performance in mitigating jailbreak risks without compromising utility.
多模态大语言模型(MLLMs)易受有害令牌引发的越狱攻击,而提出的SafePTR框架通过选择性修剪特定层的有害令牌,无需额外训练即可显著提升安全性,同时保持模型效率和实用性。
Multimodal Large Language Models (MLLMs) are vulnerable to jailbreak attacks through harmful tokens, but the proposed SafePTR framework effectively enhances safety by pruning these tokens in specific layers without extra training, maintaining model efficiency and utility.
Authors:Huiqiang Chen, Tianqing Zhu, Xin Yu, Wanlei Zhou
Abstract:
Machine unlearning aims to remove the influence of specific samples from a trained model. A key challenge in this process is over-unlearning, where the model's performance on the remaining data significantly drops due to the change in the model's parameters. Existing unlearning algorithms depend on the remaining data to prevent this issue. As such, these methods are inapplicable in a more practical scenario, where only the unlearning samples are available (i.e., zero-shot unlearning). This paper presents a novel framework, ZS-PAG, to fill this gap. Our approach offers three key innovations: (1) we approximate the inaccessible remaining data by generating adversarial samples; (2) leveraging the generated samples, we pinpoint a specific subspace to perform the unlearning process, therefore preventing over-unlearning in the challenging zero-shot scenario; and (3) we consider the influence of the unlearning process on the remaining samples and design an influence-based pseudo-labeling strategy. As a result, our method further improves the model's performance after unlearning. The proposed method holds a theoretical guarantee, and experiments on various benchmarks validate the effectiveness and superiority of our proposed method over several baselines.
中文摘要:本文提出ZS-PAG框架,通过生成对抗样本来模拟不可获取的剩余数据,在特定子空间执行遗忘操作,并采用基于影响的伪标签策略,有效解决了零样本场景下的过度遗忘问题,显著提升了模型性能。
English summary: This paper introduces ZS-PAG, a novel zero-shot machine unlearning framework that addresses over-unlearning by generating adversarial samples to approximate inaccessible data, identifying a specific subspace for unlearning, and employing influence-based pseudo-labeling to maintain model performance.
Authors:Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng, Jing Huang, Jiri Gesi, Ying Xu, Bingsheng Yao, Dakuo Wang
Abstract:
Nearly all human work is collaborative; thus, the evaluation of real-world NLP applications often requires multiple dimensions that align with diverse human perspectives. As real human evaluator resources are often scarce and costly, the emerging "LLM-as-a-judge" paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to Generate multi-dimensional feedback. Our evaluation experiments in both the educational and medical domains demonstrate that MAJ-EVAL can generate evaluation results that better align with human experts' ratings compared with conventional automated evaluation metrics and existing LLM-as-a-judge methods.
Chinese: MAJ-EVAL框架通过从文档自动构建多样化评估者角色,并利用多智能体辩论生成多维反馈,解决了现有LLM评估方法的局限,其评估结果与人类专家评分更为一致。
English: The MAJ-EVAL framework addresses limitations in existing LLM-as-a-judge methods by automatically creating diverse evaluator personas from documents and using multi-agent debates to produce multi-dimensional feedback that aligns better with human expert ratings.
Authors:Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou
Abstract:
Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.
Chinese: 本研究提出效率杠杆(EL)作为衡量混合专家模型相对于密集模型计算优势的指标,揭示了可预测的扩展规律,并通过实验验证了仅用少量计算资源即可达到密集模型性能的可行性。
English: The study introduces Efficiency Leverage (EL) as a metric to quantify the computational benefits of Mixture-of-Experts models over dense equivalents, revealing predictable scaling laws and validating them through a pilot model that matches dense model performance with significantly reduced resources.
Authors:Changxin Tian, Jiapeng Wang, Qian Zhao, Kunlong Chen, Jia Liu, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou
Abstract:
Recent advances in learning rate (LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies-including cosine decay, linear decay and inverse square root decay-as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration-the training window for checkpoint aggregation-as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM's potential for long-term model refinement.
中文:WSM框架通过将学习率衰减策略转化为原则性模型平均方案,建立了衰减与模型合并的理论桥梁,其中合并时长被确认为关键性能因素,在MATH(+3.5%)、HumanEval(+2.9%)和MMLU-Pro(+5.5%)等基准测试中持续超越WSD方法。
English: The WSM framework bridges learning rate decay and model merging by theoretically emulating decay strategies as principled averaging schemes, with merge duration identified as the key performance factor, consistently outperforming WSD across benchmarks including MATH (+3.5%), HumanEval (+2.9%), and MMLU-Pro (+5.5%).
Authors:Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Yixin Liu, Chengye Wang, Lovekesh Vig, Arman Cohan
Abstract:
We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.
中文: AbGen是首个评估大语言模型设计科学研究消融实验能力的基准,揭示了DeepSeek-R1-0528等模型与人类专家间的显著性能差距,同时发现现有自动评估方法不可靠,为此开发了元评估基准AbGen-Eval。
English: AbGen is the first benchmark for evaluating LLMs' ability to design ablation studies in scientific research, revealing a significant performance gap between models like DeepSeek-R1-0528 and human experts while demonstrating the unreliability of current automated evaluation methods.
Authors:Xinyu Tang, Zhihao Lv, Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, Jun Zhou
Abstract:
Large language models (LLMs) have shown impressive abilities in leveraging pretrained knowledge through prompting, but they often struggle with unseen tasks, particularly in data-scarce scenarios. While cross-task in-context learning offers a direct solution for transferring knowledge across tasks, it still faces critical challenges in terms of robustness, scalability, and efficiency. In this paper, we investigate whether cross-task transfer can be achieved via latent space steering without parameter updates or input expansion. Through an analysis of activation patterns in the latent space of LLMs, we observe that the enhanced activations induced by in-context examples have consistent patterns across different tasks. Inspired by these findings, we propose CAST, a novel Cross-task Activation Steering Transfer framework that enables effective transfer by manipulating the model's internal activation states. Our approach first selects influential and diverse samples from high-resource tasks, then utilizes their contrastive representation-enhanced activations to adapt LLMs to low-resource tasks. Extensive experiments across both cross-domain and cross-lingual transfer settings show that our method outperforms competitive baselines and demonstrates superior scalability and lower computational costs.
中文摘要:本文提出CAST框架,通过利用高资源任务的对比表征增强激活来引导大语言模型的内部状态,实现了无需参数更新的跨任务知识迁移,在多种实验场景下展现出优越性能和较低计算成本。
English Summary: The paper introduces CAST, a framework that enables cross-task knowledge transfer in large language models by steering their internal activations using contrastive representations from high-resource tasks, achieving improved performance with greater efficiency and scalability.
Authors:Yilun Zhao, Chengye Wang, Chuhan Li, Arman Cohan
Abstract:
This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.
中文: 本文推出首个评估科学文献示意图解读能力的基准MISS-QA,通过对18种前沿多模态模型的测试,发现其与人类专家存在显著性能差距,并为提升多模态科学文献理解提供了关键洞见。
English: This paper presents MISS-QA, the first benchmark for evaluating models' interpretation of scientific schematic diagrams, revealing a significant performance gap between 18 tested multimodal foundation models and human experts while providing key insights for improvement.
Authors:Wei Zhang, Jian Yang, Jiaxi Yang, Ya Wang, Zhoujun Li, Zeyu Cui, Binyuan Hui, Junyang Lin
Abstract:
Code large language models (LLMs) enhance programming by understanding and generating code across languages, offering intelligent feedback, bug detection, and code updates through reflection, improving development efficiency and accessibility. While benchmarks (e.g. HumanEval/LiveCodeBench) evaluate code generation and real-world relevance, previous works ignore the scenario of modifying code in repositories. Considering challenges remaining in improving reflection capabilities and avoiding data contamination in dynamic benchmarks, we introduce LiveRepoReflection, a challenging benchmark for evaluating code understanding and generation in multi-file repository contexts, featuring 1,888 rigorously filtered test cases across $6$ programming languages to ensure diversity, correctness, and high difficulty. Further, we create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources, used to train RepoReflectionCoder through a two-turn dialogue process involving code generation and error-driven repair. The leaderboard evaluates over 40 LLMs to reflect the model performance of repository-based code reflection.
中文: 代码大语言模型通过跨语言理解和生成代码推动编程发展,但现有基准忽略仓库代码修改场景,为此引入LiveRepoReflection评估多文件代码任务,并训练RepoReflectionCoder通过错误驱动修复来提升代码反思能力。
English: Code LLMs advance programming by generating and understanding code across languages, but existing benchmarks overlook repository code modification, prompting the creation of LiveRepoReflection for evaluating multi-file code tasks and RepoReflectionCoder trained via error-driven repair to enhance reflection capabilities.
Authors:Zhijian Xu, Yilun Zhao, Manasi Patwardhan, Lovekesh Vig, Arman Cohan
Abstract:
Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs' capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.
中文摘要:本研究提出LimitGen基准,通过文献检索和合成数据集评估大语言模型识别科研论文局限性的能力,以增强同行评审的辅助作用。
English Summary: The study introduces LimitGen, a benchmark to evaluate LLMs' ability to identify limitations in scientific papers, enhancing peer review through literature retrieval and synthetic datasets.
Authors:Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Charles McGrady, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan
Abstract:
We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.
Chinese: SciArena是一个开放协作的科学文献任务评估平台,通过社区投票对基础模型进行评价,目前支持23个模型并收集了超过1.3万份研究者投票,同时发布了用于自动化评估的基准测试。
English: SciArena is an open, collaborative platform that evaluates foundation models on scientific literature tasks through community voting, currently supporting 23 models with over 13,000 researcher votes and releasing a benchmark for automated evaluation assessment.
Authors:Neagin Neasamoni Santhi, Davide Villa, Michele Polese, Tommaso Melodia
Abstract:
Ultra-dense fifth generation (5G) and beyond networks leverage spectrum sharing and frequency reuse to enhance throughput, but face unpredictable in-band uplink (UL) interference challenges that significantly degrade Signal to Interference plus Noise Ratio (SINR) at affected Next Generation Node Bases (gNBs). This is particularly problematic at cell edges, where overlapping regions force User Equipments (UEs) to increase transmit power, and in directional millimeter wave systems, where beamforming sidelobes can create unexpected interference. The resulting signal degradation disrupts protocol operations, including scheduling and resource allocation, by distorting quality indicators like Reference Signal Received Power (RSRP) and Received Signal Strength Indicator (RSSI), and can compromise critical functions such as channel state reporting and Hybrid Automatic Repeat Request (HARQ) acknowledgments. To address this problem, this article introduces InterfO-RAN, a real-time programmable solution that leverages a Convolutional Neural Network (CNN) to process In-phase and Quadrature (I/Q) samples in the gNB physical layer, detecting in-band interference with accuracy exceeding 91% in under 650 us. InterfO-RAN represents the first O-RAN dApp accelerated on Graphics Processing Unit (GPU), coexisting with the 5G NR physical layer processing of NVIDIA Aerial. Deployed in an end-to-end private 5G network with commercial Radio Units (RUs) and smartphones, our solution was trained and tested on more than 7 million NR UL slots collected from real-world environments, demonstrating robust interference detection capabilities essential for maintaining network performance in dense deployments.
中文: 超密集5G网络面临不可预测的上行链路干扰,会降低信号质量并中断协议操作,而InterfO-RAN提出了基于CNN的实时解决方案,在650微秒内实现超过91%的干扰检测准确率,并在实际部署中得到验证。
English: Ultra-dense 5G networks suffer from unpredictable uplink interference that degrades signal quality and disrupts operations, but InterfO-RAN introduces a real-time CNN-based solution achieving over 91% detection accuracy in under 650 microseconds, validated in real-world deployments.
Authors:Yuan Tian, Shuo Wang, Rongzhao Zhang, Zijian Chen, Yankai Jiang, Chunyi Li, Xiangyang Zhu, Fang Yan, Qiang Hu, XiaoSong Wang, Guangtao Zhai
Abstract:
Medical imaging has significantly advanced computer-aided diagnosis, yet its re-identification (ReID) risks raise critical privacy concerns, calling for de-identification (DeID) techniques. Unfortunately, existing DeID methods neither particularly preserve medical semantics, nor are flexibly adjustable towards different privacy levels. To address these issues, we propose a divide-and-conquer framework comprising two steps: (1) Identity-Blocking, which blocks varying proportions of identity-related regions, to achieve different privacy levels; and (2) Medical-Semantics-Compensation, which leverages pre-trained Medical Foundation Models (MFMs) to extract medical semantic features to compensate the blocked regions. Moreover, recognizing that features from MFMs may still contain residual identity information, we introduce a Minimum Description Length principle-based feature decoupling strategy, to effectively decouple and discard such identity components. Extensive evaluations against existing approaches across seven datasets and three downstream tasks, demonstrates our state-of-the-art performance.
Chinese: 本研究提出了一种新颖的分治框架,通过屏蔽身份相关区域实现可调节的隐私保护级别,同时利用预训练医学基础模型补偿医学语义,并在多个数据集和任务上的广泛评估证明了其卓越性能。
English: This study introduces a novel divide-and-conquer framework for medical image de-identification that blocks identity-related regions to achieve adjustable privacy levels while compensating for medical semantics using pre-trained models, with extensive evaluations demonstrating superior performance across multiple datasets and tasks.
Authors:Longwen Zhang, Qixuan Zhang, Haoran Jiang, Yinuo Bai, Wei Yang, Lan Xu, Jingyi Yu
Abstract:
3D creation has always been a unique human strength, driven by our ability to deconstruct and reassemble objects using our eyes, mind and hand. However, current 3D design tools struggle to replicate this natural process, requiring considerable artistic expertise and manual labor. This paper introduces BANG, a novel generative approach that bridges 3D generation and reasoning, allowing for intuitive and flexible part-level decomposition of 3D objects. At the heart of BANG is "Generative Exploded Dynamics", which creates a smooth sequence of exploded states for an input geometry, progressively separating parts while preserving their geometric and semantic coherence.
BANG utilizes a pre-trained large-scale latent diffusion model, fine-tuned for exploded dynamics with a lightweight exploded view adapter, allowing precise control over the decomposition process. It also incorporates a temporal attention module to ensure smooth transitions and consistency across time. BANG enhances control with spatial prompts, such as bounding boxes and surface regions, enabling users to specify which parts to decompose and how. This interaction can be extended with multimodal models like GPT-4, enabling 2D-to-3D manipulations for more intuitive and creative workflows.
The capabilities of BANG extend to generating detailed part-level geometry, associating parts with functional descriptions, and facilitating component-aware 3D creation and manufacturing workflows. Additionally, BANG offers applications in 3D printing, where separable parts are generated for easy printing and reassembly. In essence, BANG enables seamless transformation from imaginative concepts to detailed 3D assets, offering a new perspective on creation that resonates with human intuition.
中文: BANG提出了一种生成方法,通过“生成式爆炸动态”实现3D物体的直观部件级分解,让用户能够精确控制并将创意转化为精细的3D资源,优化了工作流程。
English: BANG introduces a generative approach that enables intuitive part-level decomposition of 3D objects through "Generative Exploded Dynamics," allowing users to transform concepts into detailed 3D assets with precise control and enhanced workflows.
Authors:Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, Min Zhang
Abstract:
Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, such as retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. Then, we describe advanced roles enabled by PLMs, such as multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.
中文: 本综述全面探讨了预训练语言模型时代下的通用文本嵌入,详细阐述了其基本架构、PLM驱动的开发作用及未来研究方向,旨在为研究人员提供重要参考。
English: This survey comprehensively examines general-purpose text embeddings (GPTE) in the era of pretrained language models, detailing their architecture, PLM-driven roles in development, and future research directions to serve as a key reference for researchers.
Authors:Yubin Chen, Xuyang Guo, Zhenmei Shi, Zhao Song, Jiahao Zhang
Abstract:
Text-to-video (T2V) models have shown remarkable performance in generating visually reasonable scenes, while their capability to leverage world knowledge for ensuring semantic consistency and factual accuracy remains largely understudied. In response to this challenge, we propose T2VWorldBench, the first systematic evaluation framework for evaluating the world knowledge generation abilities of text-to-video models, covering 6 major categories, 60 subcategories, and 1,200 prompts across a wide range of domains, including physics, nature, activity, culture, causality, and object. To address both human preference and scalable evaluation, our benchmark incorporates both human evaluation and automated evaluation using vision-language models (VLMs). We evaluated the 10 most advanced text-to-video models currently available, ranging from open source to commercial models, and found that most models are unable to understand world knowledge and generate truly correct videos. These findings point out a critical gap in the capability of current text-to-video models to leverage world knowledge, providing valuable research opportunities and entry points for constructing models with robust capabilities for commonsense reasoning and factual generation.
中文: T2VWorldBench作为首个系统性评估文本到视频模型世界知识生成能力的框架,发现当前多数模型在跨领域常识推理和事实准确性方面存在显著不足,为提升模型鲁棒性指明了研究方向。
English: T2VWorldBench is introduced as the first comprehensive framework to assess text-to-video models' ability to generate world knowledge, revealing that most current models struggle with semantic consistency and factual accuracy across diverse domains.
Authors:Xinping Zhao, Shouzheng Huang, Yan Zhong, Xinshuo Hu, Meishan Zhang, Baotian Hu, Min Zhang
Abstract:
Retrieval-Augmented Generation (RAG) effectively improves the accuracy of Large Language Models (LLMs). However, retrieval noises significantly impact the quality of LLMs' generation, necessitating the development of denoising mechanisms. Previous methods extract evidence straightforwardly without explicit thinking, which risks filtering out key clues and struggles with generalization. To this end, we propose EviOmni, which learns to extract rational evidence by (1) explicitly reasoning to identify potential cues within retrieval contents first, and then (2) consciously extracting to avoid omitting any key cues helpful for answering questions. Specifically, we frame evidence reasoning and evidence extraction into one unified response for end-to-end training; apply knowledge token masks for disentanglement to derive reasoning-based and extraction-based answers; and devise three types of verifiable reward functions, including answer, length, and format, to update the model via the policy optimization algorithm. Extensive experiments on three benchmark datasets show the effectiveness of EviOmni, providing compact and high-quality evidence, improving the accuracy of downstream tasks, and promoting effective application in online RAG systems.
中文摘要:EviOmni通过构建统一框架,先对检索内容进行显式推理识别关键线索,再通过意识提取保留重要证据,结合可验证奖励机制提升下游任务准确率,有效优化了检索增强生成的性能。
English Summary: EviOmni enhances Retrieval-Augmented Generation by introducing a unified framework that explicitly reasons about and extracts key evidence from retrieved content, improving downstream task accuracy through verifiable rewards and policy optimization.
Authors:Yule Li, Yifeng Lu, Zhen Wang, Zhewei Wei, Yaliang Li, Bolin Ding
Abstract:
In recent years, graph neural networks (GNN) have achieved unprecedented successes in node classification tasks. Although GNNs inherently encode specific inductive biases (e.g., acting as low-pass or high-pass filters), most existing methods implicitly assume conditional independence among node labels in their optimization objectives. While this assumption is suitable for traditional classification tasks such as image recognition, it contradicts the intuitive observation that node labels in graphs remain correlated, even after conditioning on the graph structure. To make structured predictions for node labels, we propose ReDiSC, namely, Reparameterized masked Diffusion model for Structured node Classification. ReDiSC estimates the joint distribution of node labels using a reparameterized masked diffusion model, which is learned through the variational expectation-maximization (EM) framework. Our theoretical analysis shows the efficiency advantage of ReDiSC in the E-step compared to DPM-SNC, a state-of-the-art model that relies on a manifold-constrained diffusion model in continuous domain. Meanwhile, we explicitly link ReDiSC's M-step objective to popular GNN and label propagation hybrid approaches. Extensive experiments demonstrate that ReDiSC achieves superior or highly competitive performance compared to state-of-the-art GNN, label propagation, and diffusion-based baselines across both homophilic and heterophilic graphs of varying sizes. Notably, ReDiSC scales effectively to large-scale datasets on which previous structured diffusion methods fail due to computational constraints, highlighting its significant practical advantage in structured node classification tasks.
中文: 提出的ReDiSC模型通过重参数化掩码扩散模型捕捉节点标签间的关联性,解决了传统图神经网络的条件独立性限制,在不同类型图表中实现了卓越性能和可扩展性。
English: The proposed ReDiSC model addresses the conditional independence limitation in traditional graph neural networks by using a reparameterized masked diffusion model to capture node label correlations, achieving superior performance and scalability across diverse graph types.
Authors:Michele Polese, Niloofar Mohamadi, Salvatore D'Oro, Tommaso Melodia
Abstract:
The proliferation of data-intensive Artificial Intelligence (AI) applications at the network edge demands a fundamental shift in RAN design, from merely consuming AI for network optimization, to actively enabling distributed AI workloads. This paradigm shift presents a significant opportunity for network operators to monetize AI at the edge while leveraging existing infrastructure investments. To realize this vision, this article presents a novel converged O-RAN and AI-RAN architecture that unifies orchestration and management of both telecommunications and AI workloads on shared infrastructure. The proposed architecture extends the Open RAN principles of modularity, disaggregation, and cloud-nativeness to support heterogeneous AI deployments. We introduce two key architectural innovations: (i) the AI-RAN Orchestrator, which extends the O-RAN Service Management and Orchestration (SMO) to enable integrated resource and allocation across RAN and AI workloads; and (ii) AI-RAN sites that provide distributed edge AI platforms with real-time processing capabilities. The proposed system supports flexible deployment options, allowing AI workloads to be orchestrated with specific timing requirements (real-time or batch processing) and geographic targeting. The proposed architecture addresses the orchestration requirements for managing heterogeneous workloads at different time scales while maintaining open, standardized interfaces and multi-vendor interoperability.
中文: 本文提出了一种融合O-RAN与AI-RAN的新型架构,通过在共享基础设施上统一编排电信与AI工作负载,引入AI-RAN编排器和分布式AI-RAN站点,支持具备实时处理能力的灵活部署方案。
English: This article introduces a novel converged O-RAN and AI-RAN architecture that enables unified orchestration of telecommunications and AI workloads on shared infrastructure, featuring an AI-RAN Orchestrator and distributed AI-RAN sites to support flexible deployments with real-time capabilities.
Authors:Yukun Ma, Jiayi Zhang, Ziheng Liu, Guowei Shi, Bo Ai
Abstract:
Cell-free massive multiple-input multiple-output (CF mMIMO) has emerged as a prominent candidate for future networks due to its ability to significantly enhance spectral efficiency by eliminating inter-cell interference. However, its practical deployment faces considerable challenges, such as high computational complexity and the optimization of its complex processing. To address these challenges, this correspondence proposes a framework based on a sparse multi-dimensional graph neural network (SP-MDGNN), which sparsifies the connections between access points (APs) and user equipments (UEs) to significantly reduce computational complexity while maintaining high performance. In addition, the weighted minimum mean square error (WMMSE) algorithm is introduced as a comparative method to further analyze the trade-off between performance and complexity. Simulation results demonstrate that the sparse method achieves an optimal balance between performance and complexity, significantly reducing the computational complexity of the original MDGNN method while incurring only a slight performance degradation, providing insights for the practical deployment of CF mMIMO systems in large-scale network.
中文: 提出的稀疏多维图神经网络框架在保持高性能的同时,显著降低了无蜂窝大规模MIMO系统的计算复杂度,为实际大规模部署提供了可行方案。
English: The proposed sparse multi-dimensional graph neural network (SP-MDGNN) framework effectively reduces computational complexity in cell-free massive MIMO systems while maintaining high performance, offering a viable solution for practical large-scale deployment.
Authors:Zhiyao Ren, Siyuan Liang, Aishan Liu, Dacheng Tao
Abstract:
In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs) due to its adaptability and parameter-free nature. However, it also introduces a critical vulnerability to backdoor attacks, where adversaries can manipulate LLM behaviors by simply poisoning a few ICL demonstrations. In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts within poisoned demonstrations, jointly influencing the probability of model outputs. Through theoretical analysis, we derive an upper bound for ICL backdoor effects, revealing that the vulnerability is dominated by the concept preference ratio between the task and the backdoor. Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio. Our method encourages LLMs to select clean demonstrations during the ICL phase by leveraging confidence and similarity scores, effectively mitigating susceptibility to backdoor attacks. Extensive experiments across multiple LLMs and tasks demonstrate that our method achieves state-of-the-art defense effectiveness, significantly outperforming existing approaches (+26.02% on average). Furthermore, our method exhibits exceptional adaptability and defensive performance even for closed-source models (e.g., GPT-4).
中文: 本文提出双重学习假说,揭示大语言模型会同时学习任务相关概念和后门概念,并提出ICLShield防御机制,通过动态调整概念偏好来有效抵御后门攻击,在多个模型上实现了最先进的防御效果。
English: This paper introduces the dual-learning hypothesis, revealing that large language models learn both task and backdoor concepts from poisoned demonstrations, and proposes ICLShield—a defense mechanism that dynamically adjusts concept preference to mitigate backdoor attacks with state-of-the-art effectiveness across multiple models.
Authors:Siyuan Liang, Tianmeng Fang, Zhe Liu, Aishan Liu, Yan Xiao, Jinyuan He, Ee-Chien Chang, Xiaochun Cao
Abstract:
With the wide application of multimodal foundation models in intelligent agent systems, scenarios such as mobile device control, intelligent assistant interaction, and multimodal task execution are gradually relying on such large model-driven agents. However, the related systems are also increasingly exposed to potential jailbreak risks. Attackers may induce the agents to bypass the original behavioral constraints through specific inputs, and then trigger certain risky and sensitive operations, such as modifying settings, executing unauthorized commands, or impersonating user identities, which brings new challenges to system security. Existing security measures for intelligent agents still have limitations when facing complex interactions, especially in detecting potentially risky behaviors across multiple rounds of conversations or sequences of tasks. In addition, an efficient and consistent automated methodology to assist in assessing and determining the impact of such risks is currently lacking. This work explores the security issues surrounding mobile multimodal agents, attempts to construct a risk discrimination mechanism by incorporating behavioral sequence information, and designs an automated assisted assessment scheme based on a large language model. Through preliminary validation in several representative high-risk tasks, the results show that the method can improve the recognition of risky behaviors to some extent and assist in reducing the probability of agents being jailbroken. We hope that this study can provide some valuable references for the security risk modeling and protection of multimodal intelligent agent systems.
中文摘要:本研究针对多模态智能代理的安全漏洞,通过构建基于行为序列的风险判别机制和自动化辅助评估方案,旨在提高对越狱行为的识别能力并降低系统风险。
English Summary: This study addresses the security vulnerabilities of multimodal intelligent agents by developing a risk discrimination mechanism using behavioral sequences and an automated assessment scheme to enhance detection of jailbreak attempts and reduce system risks.
Authors:Jindong Li, Tenglong Li, Ruiqi Chen, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng
Abstract:
Deploying large language models (LLMs) on embedded devices remains a significant research challenge due to the high computational and memory demands of LLMs and the limited hardware resources available in such environments. While embedded FPGAs have demonstrated performance and energy efficiency in traditional deep neural networks, their potential for LLM inference remains largely unexplored. Recent efforts to deploy LLMs on FPGAs have primarily relied on large, expensive cloud-grade hardware and have only shown promising results on relatively small LLMs, limiting their real-world applicability. In this work, we present Hummingbird, a novel FPGA accelerator designed specifically for LLM inference on embedded FPGAs. Hummingbird is smaller, targeting embedded FPGAs such as the KV260 and ZCU104 with 67% LUT, 39% DSP, and 42% power savings over existing research. Hummingbird is stronger, targeting LLaMA3-8B and supporting longer contexts, overcoming the typical 4GB memory constraint of embedded FPGAs through offloading strategies. Finally, Hummingbird is faste, achieving 4.8 tokens/s and 8.6 tokens/s for LLaMA3-8B on the KV260 and ZCU104 respectively, with 93-94% model bandwidth utilization, outperforming the prior 4.9 token/s for LLaMA2-7B with 84% bandwidth utilization baseline. We further demonstrate the viability of industrial applications by deploying Hummingbird on a cost-optimized Spartan UltraScale FPGA, paving the way for affordable LLM solutions at the edge.
中文: 本文提出了Hummingbird,一种专为嵌入式系统设计的紧凑高效FPGA加速器,通过卸载策略克服内存限制,在显著节省资源的同时实现了高性能大语言模型推理和更快的处理速度。
English: This paper introduces Hummingbird, a compact and efficient FPGA accelerator designed for embedded systems that enables high-performance LLM inference with significant resource savings and improved speed, even overcoming memory constraints through offloading strategies.
Authors:Xuchen Wei, Yangxin Wu, Yaoyin Zhang, Henglyu Liu, Kehai Chen, Xuefeng Bai, Min Zhang
Abstract:
This paper presents HITSZ's submission for the IWSLT 2025 Indic track, focusing on speech-to-text translation (ST) for English-to-Indic and Indic-to-English language pairs. To enhance translation quality in this low-resource scenario, we propose an end-to-end system integrating the pre-trained Whisper automated speech recognition (ASR) model with Krutrim, an Indic-specialized large language model (LLM). Experimental results demonstrate that our end-to-end system achieved average BLEU scores of $28.88$ for English-to-Indic directions and $27.86$ for Indic-to-English directions. Furthermore, we investigated the Chain-of-Thought (CoT) method. While this method showed potential for significant translation quality improvements on successfully parsed outputs (e.g. a $13.84$ BLEU increase for Tamil-to-English), we observed challenges in ensuring the model consistently adheres to the required CoT output format.
中文: 本文介绍了HITSZ团队为IWSLT 2025印度语赛道开发的端到端语音翻译系统,通过整合Whisper语音识别与专攻印度语的Krutrim大模型取得优异翻译效果,并探索了具有潜力但输出稳定性有待提升的思维链增强方法。
English: This paper introduces HITSZ's IWSLT 2025 Indic track submission, featuring an end-to-end speech translation system combining Whisper ASR with the Indic-specialized Krutrim LLM, which achieved competitive BLEU scores and explored the promising yet inconsistent Chain-of-Thought method for quality enhancement.
Authors:Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu
Abstract:
3D modeling is moving from virtual to physical. Existing 3D generation primarily emphasizes geometries and textures while neglecting physical-grounded modeling. Consequently, despite the rapid development of 3D generative models, the synthesized 3D assets often overlook rich and important physical properties, hampering their real-world application in physical domains like simulation and embodied AI. As an initial attempt to address this challenge, we propose \textbf{PhysX-3D}, an end-to-end paradigm for physical-grounded 3D asset generation. 1) To bridge the critical gap in physics-annotated 3D datasets, we present PhysXNet - the first physics-grounded 3D dataset systematically annotated across five foundational dimensions: absolute scale, material, affordance, kinematics, and function description. In particular, we devise a scalable human-in-the-loop annotation pipeline based on vision-language models, which enables efficient creation of physics-first assets from raw 3D assets.2) Furthermore, we propose \textbf{PhysXGen}, a feed-forward framework for physics-grounded image-to-3D asset generation, injecting physical knowledge into the pre-trained 3D structural space. Specifically, PhysXGen employs a dual-branch architecture to explicitly model the latent correlations between 3D structures and physical properties, thereby producing 3D assets with plausible physical predictions while preserving the native geometry quality. Extensive experiments validate the superior performance and promising generalization capability of our framework. All the code, data, and models will be released to facilitate future research in generative physical AI.
中文: 当前三维建模缺乏物理属性,因此PhysX-3D提出了一种基于物理的生成框架,通过新数据集和模型创建适用于现实世界应用的逼真三维资产。
English: Current 3D modeling lacks physical properties, so PhysX-3D introduces a physics-grounded generation framework with a new dataset and model to create realistic 3D assets for real-world applications.
Authors:Shuyang Xu, Zhiyang Dou, Mingyi Shi, Liang Pan, Leo Ho, Jingbo Wang, Yuan Liu, Cheng Lin, Yuexin Ma, Wenping Wang, Taku Komura
Abstract:
Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA could generate diverse realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our model and dataset will be open-sourced upon acceptance. Please refer to our supplementary video for more details.
中文摘要:本文针对空间音频驱动人体运动这一未充分探索的挑战,首次提出全面的空间音频-人体运动数据集(SAM)和基于扩散生成的MOSPA框架,通过有效融合机制实现了高质量运动生成并达到最先进性能。
English Summary: This paper introduces the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset and a diffusion-based generative framework called MOSPA to address the underexplored challenge of generating realistic human movements in response to spatial audio, achieving state-of-the-art performance.
Authors:Fan Shi, Bin Li, Xiangyang Xue
Abstract:
Abstract visual reasoning (AVR) enables humans to quickly discover and generalize abstract rules to new scenarios. Designing intelligent systems with human-like AVR abilities has been a long-standing topic in the artificial intelligence community. Deep AVR solvers have recently achieved remarkable success in various AVR tasks. However, they usually use task-specific designs or parameters in different tasks. In such a paradigm, solving new tasks often means retraining the model, and sometimes retuning the model architectures, which increases the cost of solving AVR problems. In contrast to task-specific approaches, this paper proposes a novel Unified Conditional Generative Solver (UCGS), aiming to address multiple AVR tasks in a unified framework. First, we prove that some well-known AVR tasks can be reformulated as the problem of estimating the predictability of target images in problem panels. Then, we illustrate that, under the proposed framework, training one conditional generative model can solve various AVR tasks. The experiments show that with a single round of multi-task training, UCGS demonstrates abstract reasoning ability across various AVR tasks. Especially, UCGS exhibits the ability of zero-shot reasoning, enabling it to perform abstract reasoning on problems from unseen AVR tasks in the testing phase.
Chinese: 本文提出了一种统一条件生成求解器(UCGS),能够在单一框架下解决多种抽象视觉推理任务,并具备零样本推理能力,无需重新训练即可应对未见问题。
English: This paper introduces a Unified Conditional Generative Solver (UCGS) that addresses multiple abstract visual reasoning tasks in a single framework, enabling zero-shot reasoning on unseen problems without retraining.
Authors:Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Huijie Lv, Ming Zhang, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang
Abstract:
Reasoning in large language models has long been a central research focus, and recent studies employing reinforcement learning (RL) have introduced diverse methods that yield substantial performance gains with minimal or even no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance performance. However, these breakthroughs are predominantly observed for the mathematically strong Qwen2.5 series on benchmarks such as MATH-500, AMC, and AIME, and seldom transfer to models like Llama, which warrants a more in-depth investigation. In this work, our empirical analysis reveals that pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks. Consequently, conclusions derived from contaminated benchmarks on Qwen2.5 series may be unreliable. To obtain trustworthy evaluation results, we introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation. Using this leakage-free dataset, we show that only accurate reward signals yield steady improvements that surpass the base model's performance boundary in mathematical reasoning, whereas random or incorrect rewards do not. Moreover, we conduct more fine-grained analyses to elucidate the factors underlying the different performance observed on the MATH-500 and RandomCalculation benchmarks. Consequently, we recommend that future studies evaluate models on uncontaminated benchmarks and, when feasible, test various model series to ensure trustworthy conclusions about RL and related methods.
Chinese: 近期研究表明,强化学习能提升如Qwen2.5等大语言模型的推理能力,但基准测试中的数据污染可能导致结论不可靠,因此使用如RandomCalculation的纯净数据集对准确评估至关重要。
English: Recent research shows that reinforcement learning can improve reasoning in large language models like Qwen2.5, but data contamination in benchmarks may lead to unreliable conclusions, so using clean datasets like RandomCalculation is essential for accurate evaluation.
Authors:Zizheng Zhan, Ken Deng, Huaixi Tang, Wen Xiang, Kun Wu, Weihao Li, Wenqiang Zhu, Jingxuan Xu, Lecheng Huang, Zongxian Feng, Shaojie Wang, Shangpeng Yan, Xuxing Chen, Jiaheng Liu, Zhongyuan Peng, Zuchen Gao, Haoyang Huang, Xiaojiang Zhang, Jinghui Wang, Zheng Lin, Mengtong Li, Huiming Wang, Ziqi Zhan, Yanan Wu, Yuanxing Zhang, Jian Yang, Guang Chen, Haotian Zhang, Bin Chen, Bing Yu
Abstract:
We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks, where an automatic thinking training paradigm is proposed to dynamically switch between reasoning and non-reasoning modes based on task complexity. Specifically, first, we construct the dual-regime dataset based on a novel tagging pipeline and a multi-agent synthesis strategy, and then we apply Multi-Token Prediction (MTP)-enhanced knowledge distillation, enabling efficient and fine-grained reasoning transfer with minimal pretraining cost. Besides, we implement a cold-start initialization strategy that introduces mode-selection priors using majority-vote signals and intent-aware prompting. Finally, we propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework, offering structured guidance over both reasoning-mode selection and response accuracy. Extensive experiments across multiple benchmarks demonstrate that KAT consistently matches or even outperforms current state-of-the-art models, including DeepSeek-R1-0528 and Qwen3-235B-A22B, across a wide range of reasoning-intensive tasks while reducing token usage. Notably, KAT outperforms all open-source models and even surpasses o3-mini on the leakage-controlled LiveCodeBench Pro. Beyond academic evaluation, KAT has been successfully deployed in Kwaipilot (i.e., Kuaishou's internal coding assistant), where it improves real-world development workflows with high accuracy, efficiency, and controllable reasoning behaviors. Moreover, we are actively training a 200B Mixture-of-Experts (MoE) model with 40B active parameters, and early results already show significant gains, further demonstrating the scalability of the AutoThink paradigm.
中文: Kwaipilot-AutoThink (KAT) 是一个400亿参数的开源模型,通过动态切换推理模式解决复杂任务中的过度思考问题,在减少计算消耗的同时实现了领先性能。
English: Kwaipilot-AutoThink (KAT) is a 40B open-source model that tackles overthinking in reasoning tasks by dynamically switching between reasoning modes, achieving state-of-the-art performance while reducing token usage.
Authors:Liang Heng, Xiaoqi Li, Shangqing Mao, Jiaming Liu, Ruolin Liu, Jingli Wei, Yu-Kai Wang, Yueru Jia, Chenyang Gu, Rui Zhao, Shanghang Zhang, Hao Dong
Abstract:
Recent advancements in imitation learning have shown promising results in robotic manipulation, driven by the availability of high-quality training data. To improve data collection efficiency, some approaches focus on developing specialized teleoperation devices for robot control, while others directly use human hand demonstrations to obtain training data. However, the former requires both a robotic system and a skilled operator, limiting scalability, while the latter faces challenges in aligning the visual gap between human hand demonstrations and the deployed robot observations. To address this, we propose a human hand data collection system combined with our hand-to-gripper generative model, which translates human hand demonstrations into robot gripper demonstrations, effectively bridging the observation gap. Specifically, a GoPro fisheye camera is mounted on the human wrist to capture human hand demonstrations. We then train a generative model on a self-collected dataset of paired human hand and UMI gripper demonstrations, which have been processed using a tailored data pre-processing strategy to ensure alignment in both timestamps and observations. Therefore, given only human hand demonstrations, we are able to automatically extract the corresponding SE(3) actions and integrate them with high-quality generated robot demonstrations through our generation pipeline for training robotic policy model. In experiments, the robust manipulation performance demonstrates not only the quality of the generated robot demonstrations but also the efficiency and practicality of our data collection method. More demonstrations can be found at: https://rwor.github.io/
中文: 本研究提出了一种系统,通过腕戴式GoPro相机捕捉人手动作,并利用生成模型将其转化为机器人夹爪演示,有效弥合了视觉差异,实现了机器人操作策略的高效训练。
English: This study introduces a system that uses a wrist-mounted GoPro camera to capture human hand movements and a generative model to convert them into robot gripper demonstrations, effectively bridging the visual gap and enabling efficient training of robotic manipulation policies.
Authors:Sixiang Chen, Jiaming Liu, Siyuan Qian, Han Jiang, Lily Li, Renrui Zhang, Zhuoyang Liu, Chenyang Gu, Chengkai Hou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Abstract:
Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e.g., either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the mobile base directly influences the manipulator's actions, we introduce a mobility-to-body conditioning mechanism that guides the model to first extract base motion representations, which are then used as context prior for predicting whole-body actions. This enables whole-body control that accounts for the potential impact of the mobile base's motion. Second, to meet the perception requirements at different stages of mobile manipulation, we design a perception-aware multimodal conditioning strategy that dynamically adjusts the fusion weights between various 2D visual images and 3D point clouds, yielding visual features tailored to the current perceptual needs. This allows the model to, for example, adaptively rely more on 2D inputs when semantic information is crucial for action prediction, while placing greater emphasis on 3D geometric information when precise spatial understanding is required. We validate AC-DiT through extensive experiments on both simulated and real-world mobile manipulation tasks.
中文摘要:提出的自适应协调扩散变换器(AC-DiT)通过引入移动基座到机械臂的协调机制和感知感知的多模态策略,有效提升了移动操作中基座运动与机械臂控制的协调性,并能根据任务阶段动态调整视觉感知模式。
English Summary: The proposed Adaptive Coordination Diffusion Transformer (AC-DiT) enhances mobile manipulation by introducing a mobility-to-body conditioning mechanism and a perception-aware multimodal strategy to better coordinate base movement with manipulator control while dynamically adapting to different visual perception needs.
Authors:Yi Su, Quantong Qiu, Yuechi Zhou, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang
Abstract:
Large Language Models (LLMs) require substantial computational resources during generation. While the Key-Value (KV) cache significantly accelerates this process by storing attention intermediates, its memory footprint grows linearly with sequence length, batch size, and model size, creating a bottleneck in long-context scenarios. Various KV cache compression techniques, including token eviction, quantization, and low-rank projection, have been proposed to mitigate this bottleneck, often complementing each other. This paper focuses on enhancing token eviction strategies. Token eviction leverages the observation that the attention patterns are often sparse, allowing for the removal of less critical KV entries to save memory. However, this reduction usually comes at the cost of notable accuracy degradation, particularly under high compression ratios. To address this issue, we propose \textbf{CaliDrop}, a novel strategy that enhances token eviction through calibration. Our preliminary experiments show that queries at nearby positions exhibit high similarity. Building on this observation, CaliDrop performs speculative calibration on the discarded tokens to mitigate the accuracy loss caused by token eviction. Extensive experiments demonstrate that CaliDrop significantly improves the accuracy of existing token eviction methods.
中文: 大语言模型中的键值缓存存在内存瓶颈,本文提出的CaliDrop策略通过邻近位置查询相似性进行校准增强的令牌淘汰,在减少内存占用的同时有效保持了模型精度。
English: Large Language Models face memory bottlenecks from their Key-Value cache, which this paper addresses with CaliDrop, a calibration-enhanced token eviction strategy that reduces memory usage while maintaining accuracy by leveraging query similarity at adjacent positions.
Authors:Jiajian Xie, Shengyu Zhang, Zhou Zhao, Fan Wu, Fei Wu
Abstract:
Diffusion Models have shown remarkable proficiency in image and video synthesis. As model size and latency increase limit user experience, hybrid edge-cloud collaborative framework was recently proposed to realize fast inference and high-quality generation, where the cloud model initiates high-quality semantic planning and the edge model expedites later-stage refinement. However, excessive cloud denoising prolongs inference time, while insufficient steps cause semantic ambiguity, leading to inconsistency in edge model output. To address these challenges, we propose EC-Diff that accelerates cloud inference through gradient-based noise estimation while identifying the optimal point for cloud-edge handoff to maintain generation quality. Specifically, we design a K-step noise approximation strategy to reduce cloud inference frequency by using noise gradients between steps and applying cloud inference periodically to adjust errors. Then we design a two-stage greedy search algorithm to efficiently find the optimal parameters for noise approximation and edge model switching. Extensive experiments demonstrate that our method significantly enhances generation quality compared to edge inference, while achieving up to an average $2\times$ speedup in inference compared to cloud inference. Video samples and source code are available at https://ec-diff.github.io/.
中文摘要:EC-Diff框架通过基于梯度的噪声估计和周期性云端推理加速扩散模型处理,同时采用优化切换策略保持生成质量,实现云端推理速度平均提升2倍且超越边缘模型生成效果。
English Summary: The proposed EC-Diff framework optimizes cloud-edge collaboration for diffusion models by using gradient-based noise estimation and periodic cloud inference to accelerate processing while maintaining output quality through an optimized handoff strategy.
Authors:Yilin Lu, Jianghang Lin, Linhuang Xie, Kai Zhao, Yansong Qu, Shengchuan Zhang, Liujuan Cao, Rongrong Ji
Abstract:
Anomaly inspection plays a vital role in industrial manufacturing, but the scarcity of anomaly samples significantly limits the effectiveness of existing methods in tasks such as localization and classification. While several anomaly synthesis approaches have been introduced for data augmentation, they often struggle with low realism, inaccurate mask alignment, and poor generalization. To overcome these limitations, we propose Generate Aligned Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework. GAA leverages the strong priors of a pretrained latent diffusion model to generate realistic, diverse, and semantically aligned anomalies using only a small number of samples. The framework first employs Localized Concept Decomposition to jointly model the semantic features and spatial information of anomalies, enabling flexible control over the type and location of anomalies. It then utilizes Adaptive Multi-Round Anomaly Clustering to perform fine-grained semantic clustering of anomaly concepts, thereby enhancing the consistency of anomaly representations. Subsequently, a region-guided mask generation strategy ensures precise alignment between anomalies and their corresponding masks, while a low-quality sample filtering module is introduced to further improve the overall quality of the generated samples. Extensive experiments on the MVTec AD and LOCO datasets demonstrate that GAA achieves superior performance in both anomaly synthesis quality and downstream tasks such as localization and classification.
中文: 提出的GAA框架通过区域引导和少样本学习生成逼真且精确对齐的异常-掩码对,有效解决了现有异常合成方法在真实性和对齐精度上的不足,显著提升了工业异常检测的定位与分类性能。
English: The proposed Generate Aligned Anomaly (GAA) framework overcomes the limitations of existing anomaly synthesis methods by generating realistic and precisely aligned anomaly-mask pairs using few-shot learning, significantly improving performance in industrial inspection tasks like localization and classification.
Authors:You Huang, Lichao Chen, Jiayi Ji, Liujuan Cao, Shengchuan Zhang, Rongrong Ji
Abstract:
Interactive segmentation (IS) improves annotation efficiency by segmenting target regions from user prompts, with widespread applications in real-world scenarios. Current approaches face a critical trade-off: dense-token methods achieve superior accuracy and detail preservation but suffer from prohibitively slow processing on CPU devices, while the Segment Anything Model (SAM) advances the field with sparse prompt tokens for fast inference but compromises segmentation quality. In this paper, we propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing, which introduces four key enhancements. First, we propose Dynamic Prompt Embedding (DPE) that adaptively processes only regions of interest while avoiding additional overhead from background tokens. Second, we introduce Dynamic Hybrid Attention (DHA), which leverages previous segmentation masks to route tokens through either full attention (O(N2)) for boundary regions or our proposed efficient BSQ attention (O(N)) for non-boundary regions. Third, we develop Hybrid Mixture of Experts (HMoE), which applies similar adaptive computation strategies in FFN modules with CPU-optimized parallel processing. Finally, we present Dynamic Local Upsampling (DLU), a reverse operation of DPE, which localizes objects with a lightweight MLP and performs fine-grained upsampling only in detected regions. Experimental results on high-precision IS benchmarks demonstrate that Inter2Former achieves SOTA performance with high efficiency on CPU devices.
中文摘要:Inter2Former通过四项计算优化解决了交互式分割中精度与速度的权衡问题,在CPU设备上实现了高效且高精度的最先进性能。
English Summary: Inter2Former addresses the trade-off between accuracy and speed in interactive segmentation by introducing four computational optimizations that achieve state-of-the-art performance on CPU devices while maintaining high precision.
Authors:Geng Sun, Likun Zhang, Jiahui Li, Jing Wu, Jiacheng Wang, Zemin Sun, Changyuan Zhao, Victor C. M. Leung
Abstract:
The integration of unmanned aerial vehicles (UAVs) with Internet of Things (IoT) networks offers promising solutions for efficient data collection. However, the limited energy capacity of UAVs remains a significant challenge. In this case, laser beam directors (LBDs) have emerged as an effective technology for wireless charging of UAVs during operation, thereby enabling sustained data collection without frequent returns to charging stations (CSs). In this work, we investigate the age of information (AoI) optimization in LBD-powered UAV-assisted IoT networks, where multiple UAVs collect data from distributed IoTs while being recharged by laser beams. We formulate a joint optimization problem that aims to minimize the peak AoI while determining optimal UAV trajectories and laser charging strategies. This problem is particularly challenging due to its non-convex nature, complex temporal dependencies, and the need to balance data collection efficiency with energy consumption constraints. To address these challenges, we propose a novel multi-agent proximal policy optimization with temporal memory and multi-agent coordination (MAPPO-TM) framework. Specifically, MAPPO-TM incorporates temporal memory mechanisms to capture the dynamic nature of UAV operations and facilitates effective coordination among multiple UAVs through decentralized learning while considering global system objectives. Simulation results demonstrate that the proposed MAPPO-TM algorithm outperforms conventional approaches in terms of peak AoI minimization and energy efficiency. Ideally, the proposed algorithm achieves up to 15.1% reduction in peak AoI compared to conventional multi-agent deep reinforcement learning (MADRL) methods.
中文: 本研究提出MAPPO-TM算法,通过协同优化无人机轨迹与激光充电策略,在激光供电的无人机物联网网络中实现了峰值信息年龄的最小化,较传统方法性能提升达15.1%。
English: This study introduces the MAPPO-TM algorithm to optimize peak age of information in laser-powered UAV-assisted IoT networks by jointly managing UAV trajectories and charging strategies, achieving a 15.1% improvement over conventional methods.
Authors:Zecheng Tang, Haitian Wang, Quantong Qiu, Baibei Ji, Ruoxi Sun, Keyan Zhou, Juntao Li, Min Zhang
Abstract:
Long-context processing has become a fundamental capability for large language models~(LLMs). To assess model's long-context performance, numerous long-context evaluation benchmarks have been proposed. However, variations in evaluation settings across these benchmarks lead to inconsistent results, making it difficult to draw reliable comparisons. Besides, the high computational cost of long-context evaluation poses a significant barrier for the community to conduct comprehensive assessments of long-context models. In this paper, we propose LOOM-Scope, a comprehensive and efficient framework for long-context evaluation. LOOM-Scope standardizes evaluation settings across diverse benchmarks, supports deployment of efficient long-context inference acceleration methods, and introduces a holistic yet lightweight benchmark suite to evaluate models comprehensively. Homepage: https://loomscope.github.io
中文: LOOM-Scope 是一个标准化且高效的框架,旨在通过统一评估设置和集成加速方法,解决长上下文模型评估中的结果不一致和高计算成本问题。
English: LOOM-Scope is a standardized and efficient framework designed to address inconsistencies and high computational costs in evaluating large language models' long-context capabilities by unifying evaluation settings and incorporating acceleration methods.
Authors:Giuseppe Cartella, Vittorio Cuculo, Alessandro D'Amelio, Marcella Cornia, Giuseppe Boccignone, Rita Cucchiara
Abstract:
Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models are publicly available at https://aimagelab.github.io/ScanDiff.
Chinese: ScanDiff提出了一种结合扩散模型与视觉Transformer的新架构,通过文本条件化生成多样且真实的人类注视扫描路径,在自由观看和任务驱动场景下均超越现有最优方法,更好地模拟了人类视觉行为的复杂性。
English: ScanDiff introduces a diffusion-based architecture with Vision Transformers to generate diverse and realistic human gaze scanpaths, outperforming state-of-the-art methods in both free-viewing and task-driven scenarios by capturing variability through textual conditioning.
Authors:Meng Zhou, Bei Li, Jiahao Liu, Xiaowen Shi, Yang Bai, Rongxiang Weng, Jingang Wang, Xunliang Cai
Abstract:
Reinforcement learning (RL) has significantly improved the reasoning ability of large language models. However, current reward models underperform in challenging reasoning scenarios and predominant RL training paradigms rely on rule-based or reference-based rewards, which impose two critical limitations: 1) the dependence on finely annotated reference answer to attain rewards; and 2) the requirement for constrained output format. These limitations fundamentally hinder further RL data scaling and sustained enhancement of model reasoning performance. To address these limitations, we propose a comprehensive framework for evaluating and improving the performance of reward models in complex reasoning scenarios. We first present a reasoning-oriented benchmark (Libra Bench), systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models, to address the limitations of existing reward model benchmarks in reasoning scenarios. We further introduce a novel approach for improving the generative reward model via learning-to-think methodologies. Based on the proposed approach, we develop Libra-RM series, a collection of generative reward models with reasoning capabilities that achieve state-of-the-art results on various benchmarks. Comprehensive downstream experiments are conducted and the experimental results demonstrate the correlation between our Libra Bench and downstream application, and the potential of Libra-RM to further improve reasoning models with unlabeled data.
Chinese Summary: 强化学习提升了大型语言模型的推理能力,但现有奖励模型在复杂推理场景中表现不佳,依赖参考答案和受限输出格式,为此我们提出了新基准和生成式奖励模型,实现了最优性能并利用未标注数据持续提升推理能力。
English Summary: Reinforcement learning enhances large language models' reasoning, but current reward models struggle in complex scenarios due to reliance on reference answers and constrained outputs, prompting the development of a new benchmark and generative reward models that achieve state-of-the-art results and enable further improvements with unlabeled data.
Authors:Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jianhua Han, Hang Xu, Xiaodan Liang
Abstract:
Recent advances in multimodal large language models (MLLMs) have shown impressive reasoning capabilities. However, further enhancing existing MLLMs necessitates high-quality vision-language datasets with carefully curated task complexities, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct supervised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks. Our code, models, and datasets will be released.
中文: 提出的C2-Evo框架通过跨模态数据进化与数据-模型协同演化的双循环机制,同步提升训练数据质量与模型能力,在数学推理基准测试中持续获得显著性能提升。
English: The proposed C2-Evo framework addresses multimodal model limitations by jointly evolving training data and model capabilities through cross-modal and data-model feedback loops, achieving significant performance gains in mathematical reasoning benchmarks.
Authors:Mengxi Liu, Lala Shakti Swarup Ray, Sizhen Bian, Ko Watanabe, Ankur Bhatt, Joanna Sorysz, Russel Torah, Bo Zhou, Paul Lukowicz
Abstract:
We present NeckSense, a novel wearable system for head pose tracking that leverages multi-channel bio-impedance sensing with soft, dry electrodes embedded in a lightweight, necklace-style form factor. NeckSense captures dynamic changes in tissue impedance around the neck, which are modulated by head rotations and subtle muscle activations. To robustly estimate head pose, we propose a deep learning framework that integrates anatomical priors, including joint constraints and natural head rotation ranges, into the loss function design. We validate NeckSense on 7 participants using the current SOTA pose estimation model as ground truth. Our system achieves a mean per-vertex error of 25.9 mm across various head movements with a leave-one-person-out cross-validation method, demonstrating that a compact, line-of-sight-free bio-impedance wearable can deliver head-tracking performance comparable to SOTA vision-based methods.
Chinese: NeckSense是一种新型可穿戴项链系统,通过多通道生物阻抗传感结合解剖学先验的深度学习框架,无需视线即可实现精确头部姿态追踪,其性能可与当前最先进的视觉方法相媲美。
English: NeckSense is a wearable necklace system that uses multi-channel bio-impedance sensing and deep learning with anatomical constraints to accurately track head pose without line-of-sight, achieving performance comparable to state-of-the-art vision-based methods.
Authors:Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, Jingang Wang
Abstract:
Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance improvements decelerate, which is a phenomenon known as sub-scaling. This paper revisits these scaling laws by examining the impact of data quality and training strategies on model performance. Through extensive empirical analysis of over 400 models, we identify high data density and non-optimal resource allocation as key factors contributing to sub-scaling. High data density leads to diminishing returns due to redundant information, while optimal resource allocation is crucial for sustained performance improvements. We propose a sub-optimal scaling law that better predicts performance in sub-scaling regimes, highlighting the importance of data quality and diversity.
中文摘要:本文通过分析数据密度过高和资源分配不当导致大语言模型性能增长放缓的现象,提出了更准确预测次优缩放性能的新规律,强调数据质量与多样性的关键作用。
English Summary: This paper challenges traditional scaling laws by identifying high data density and suboptimal resource allocation as causes of performance deceleration in large language models, proposing a new sub-scaling law that emphasizes data quality and diversity.
Authors:Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen
Abstract:
We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model's performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.
中文: Lizard是一种线性化框架,将预训练的Transformer大语言模型转化为次二次复杂度的无限上下文生成架构,通过门控混合注意力机制在保持输出质量的同时显著提升计算效率。
English: Lizard is a linearization framework that converts pretrained Transformer-based LLMs into subquadratic architectures for infinite-context generation, using a hybrid attention mechanism with gating modules to maintain output quality while overcoming computational bottlenecks.
Authors:Lala Shakti Swarup Ray, Mengxi Liu, Deepika Gurung, Bo Zhou, Sungho Suh, Paul Lukowicz
Abstract:
Human Activity Recognition (HAR) with wearable sensors is essential for applications in healthcare, fitness, and human-computer interaction. Bio-impedance sensing offers unique advantages for fine-grained motion capture but remains underutilized due to the scarcity of labeled data. We introduce SImpHAR, a novel framework addressing this limitation through two core contributions. First, we propose a simulation pipeline that generates realistic bio-impedance signals from 3D human meshes using shortest-path estimation, soft-body physics, and text-to-motion generation serving as a digital twin for data augmentation. Second, we design a two-stage training strategy with decoupled approach that enables broader activity coverage without requiring label-aligned synthetic data. We evaluate SImpHAR on our collected ImpAct dataset and two public benchmarks, showing consistent improvements over state-of-the-art methods, with gains of up to 22.3% and 21.8%, in terms of accuracy and macro F1 score, respectively. Our results highlight the promise of simulation-driven augmentation and modular training for impedance-based HAR.
中文: SImpHAR框架通过生成仿真的生物阻抗信号进行数据增强,并采用两阶段训练策略,显著提升了基于生物阻抗的人类活动识别性能,最高可实现22.3%的准确率提升。
English: The SImpHAR framework enhances Human Activity Recognition using bio-impedance sensing by generating realistic simulated signals for data augmentation and employing a two-stage training strategy, achieving up to 22.3% accuracy improvement over existing methods.
Authors:Lala Shakti Swarup Ray, Bo Zhou, Paul Lukowicz
Abstract:
Inertial measurement units (IMUs) are central to wearable systems for activity recognition and pose estimation, but sensor placement remains largely guided by heuristics and convention. In this work, we introduce Where to Wear (W2W), a simulation-based framework for systematic exploration of IMU placement utility across the body. Using labeled motion capture data, W2W generates realistic synthetic IMU signals at 512 anatomically distributed surface patches, enabling high-resolution, task-specific evaluation of sensor performance. We validate reliability of W2W by comparing spatial performance rankings from synthetic data with real IMU recordings in two multimodal datasets, confirming strong agreement in activity-wise trends. Further analysis reveals consistent spatial trends across activity types and uncovers overlooked high-utility regions that are rarely used in commercial systems. These findings challenge long-standing placement norms and highlight opportunities for more efficient, task-adaptive sensor configurations. Overall, our results demonstrate that simulation with W2W can serve as a powerful design tool for optimizing sensor placement, enabling scalable, data-driven strategies that are impractical to obtain through physical experimentation alone.
Chinese: W2W仿真框架通过运动捕捉数据生成合成信号,系统评估了IMU传感器在身体各部位的最佳放置位置,揭示了传统方案忽略的高效区域,为可穿戴设备设计提供了数据驱动的优化方法。
English: The W2W simulation framework systematically evaluates optimal IMU sensor placements by generating synthetic data from motion capture, revealing overlooked high-utility body regions and challenging conventional placement norms for more efficient wearable designs.
Authors:Mengxi Liu, Daniel GeiÃler, Deepika Gurung, Hymalai Bello, Bo Zhou, Sizhen Bian, Paul Lukowicz, Passant Elagroudy
Abstract:
Breathing is a spontaneous but controllable body function that can be used for hands-free interaction. Our work introduces "iBreath", a novel system to detect breathing gestures similar to clicks using bio-impedance. We evaluated iBreath's accuracy and user experience using two lab studies (n=34). Our results show high detection accuracy (F1-scores > 95.2%). Furthermore, the users found the gestures easy to use and comfortable. Thus, we developed eight practical guidelines for the future development of breathing gestures. For example, designers can train users on new gestures within just 50 seconds (five trials), and achieve robust performance with both user-dependent and user-independent models trained on data from 21 participants, each yielding accuracies above 90%. Users preferred single clicks and disliked triple clicks. The median gesture duration is 3.5-5.3 seconds. Our work provides solid ground for researchers to experiment with creating breathing gestures and interactions.
中文: iBreath是一种基于生物阻抗的系统,能以超过95.2%的F1分数高精度检测呼吸手势,用户反馈舒适易用,研究提出了八项开发指南(如50秒快速训练),为呼吸交互研究奠定了坚实基础。
English: iBreath is a bio-impedance-based system that detects breathing gestures with high accuracy (over 95.2% F1-score) and user comfort, establishing eight practical guidelines for future development, such as rapid training and robust performance across user models.
Authors:Dante D. Sanchez-Gallegos, J. L. Gonzalez-Compean, Maxime Gonthier, Valerie Hayot-Sasson, J. Gregory Pauloski, Haochen Pan, Kyle Chard, Jesus Carretero, Ian Foster
Abstract:
Data distribution across different facilities offers benefits such as enhanced resource utilization, increased resilience through replication, and improved performance by processing data near its source. However, managing such data is challenging due to heterogeneous access protocols, disparate authentication models, and the lack of a unified coordination framework. This paper presents DynoStore, a system that manages data across heterogeneous storage systems. At the core of DynoStore are data containers, an abstraction that provides standardized interfaces for seamless data management, irrespective of the underlying storage systems. Multiple data container connections create a cohesive wide-area storage network, ensuring resilience using erasure coding policies. Furthermore, a load-balancing algorithm ensures equitable and efficient utilization of storage resources. We evaluate DynoStore using benchmarks and real-world case studies, including the management of medical and satellite data across geographically distributed environments. Our results demonstrate a 10\% performance improvement compared to centralized cloud-hosted systems while maintaining competitive performance with state-of-the-art solutions such as Redis and IPFS. DynoStore also exhibits superior fault tolerance, withstanding more failures than traditional systems.
中文摘要:DynoStore系统通过数据容器抽象层统一管理异构存储系统,构建广域存储网络,在提升性能与负载均衡的同时展现出优于传统系统的容错能力。
English Summary: DynoStore is a system that manages data across diverse storage systems using data containers for unified access, offering improved performance, load balancing, and superior fault tolerance compared to centralized and existing distributed solutions.
Authors:Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu, Jing Yu, Mao Zhang, Sunhao Dai, Wen Chen, Wenjun Yang, Yuning Jiang, Zhujin Gao, Bo Zheng, Chi Li, Dimin Wang, Dixuan Wang, Fan Li, Fan Zhang, Haibin Chen, Haozhuang Liu, Jialin Zhu, Jiamang Wang, Jiawei Wu, Jin Cui, Ju Huang, Kai Zhang, Kan Liu, Lang Tian, Liang Rao, Longbin Li, Lulu Zhao, Na He, Peiyang Wang, Qiqi Huang, Tao Luo, Wenbo Su, Xiaoxiao He, Xin Tong, Xu Chen, Xunke Xi, Yang Li, Yaxuan Wu, Yeqiu Yang, Yi Hu, Yinnan Song, Yuchen Li, Yujie Luo, Yujin Yuan, Yuliang Yan, Zhengyang Wang, Zhibo Xiao, Zhixin Ma, Zile Zhou, Ziqi Zhang
Abstract:
Recommender systems are among the most impactful applications of artificial intelligence, serving as critical infrastructure connecting users, merchants, and platforms. However, most current industrial systems remain heavily reliant on historical co-occurrence patterns and log-fitting objectives, i.e., optimizing for past user interactions without explicitly modeling user intent. This log-fitting approach often leads to overfitting to narrow historical preferences, failing to capture users' evolving and latent interests. As a result, it reinforces filter bubbles and long-tail phenomena, ultimately harming user experience and threatening the sustainability of the whole recommendation ecosystem.
To address these challenges, we rethink the overall design paradigm of recommender systems and propose RecGPT, a next-generation framework that places user intent at the center of the recommendation pipeline. By integrating large language models (LLMs) into key stages of user interest mining, item retrieval, and explanation generation, RecGPT transforms log-fitting recommendation into an intent-centric process. To effectively align general-purpose LLMs to the above domain-specific recommendation tasks at scale, RecGPT incorporates a multi-stage training paradigm, which integrates reasoning-enhanced pre-alignment and self-training evolution, guided by a Human-LLM cooperative judge system. Currently, RecGPT has been fully deployed on the Taobao App. Online experiments demonstrate that RecGPT achieves consistent performance gains across stakeholders: users benefit from increased content diversity and satisfaction, merchants and the platform gain greater exposure and conversions. These comprehensive improvement results across all stakeholders validates that LLM-driven, intent-centric design can foster a more sustainable and mutually beneficial recommendation ecosystem.
中文: RecGPT提出了一种以用户意图为核心的推荐框架,通过整合大型语言模型解决传统日志拟合方法的局限,在提升用户内容多样性和满意度的同时,为商家和平台带来更多曝光与转化。
English: RecGPT introduces an intent-centric recommendation framework using large language models to overcome the limitations of traditional log-fitting approaches, enhancing diversity and satisfaction for users while boosting exposure and conversions for merchants and platforms.
Authors:Kaustav Chakraborty, Zeyuan Feng, Sushant Veer, Apoorva Sharma, Wenhao Ding, Sever Topan, Boris Ivanovic, Marco Pavone, Somil Bansal
Abstract:
The advent of end-to-end autonomy stacks - often lacking interpretable intermediate modules - has placed an increased burden on ensuring that the final output, i.e., the motion plan, is safe in order to validate the safety of the entire stack. This requires a safety monitor that is both complete (able to detect all unsafe plans) and sound (does not flag safe plans). In this work, we propose a principled safety monitor that leverages modern multi-modal trajectory predictors to approximate forward reachable sets (FRS) of surrounding agents. By formulating a convex program, we efficiently extract these data-driven FRSs directly from the predicted state distributions, conditioned on scene context such as lane topology and agent history. To ensure completeness, we leverage conformal prediction to calibrate the FRS and guarantee coverage of ground-truth trajectories with high probability. To preserve soundness in out-of-distribution (OOD) scenarios or under predictor failure, we introduce a Bayesian filter that dynamically adjusts the FRS conservativeness based on the predictor's observed performance. We then assess the safety of the ego vehicle's motion plan by checking for intersections with these calibrated FRSs, ensuring the plan remains collision-free under plausible future behaviors of others. Extensive experiments on the nuScenes dataset show our approach significantly improves soundness while maintaining completeness, offering a practical and reliable safety monitor for learned autonomy stacks.
Chinese: 本文提出了一种基于多模态轨迹预测器和保形预测的安全监控方法,通过校准前向可达集来确保自动驾驶系统运动规划的安全性和可靠性,并在nuScenes数据集上验证了其有效性。
English: This paper introduces a principled safety monitor for autonomous vehicles that uses multi-modal trajectory predictors and conformal prediction to ensure both completeness in detecting unsafe plans and soundness in avoiding false alarms, validated through experiments on the nuScenes dataset.
Authors:Sodtavilan Odonchimed, Tatsuya Matsushima, Simon Holk, Yusuke Iwasawa, Yutaka Matsuo
Abstract:
Diffusion Policies (DPs) have attracted attention for their ability to achieve significant accuracy improvements in various imitation learning tasks. However, DPs depend on Diffusion Models, which require multiple noise removal steps to generate a single action, resulting in long generation times. To solve this problem, knowledge distillation-based methods such as Consistency Policy (CP) have been proposed. However, these methods require a significant amount of training time, especially for difficult tasks. In this study, we propose RAGDP (Retrieve-Augmented Generation for Diffusion Policies) as a novel framework that eliminates the need for additional training using a knowledge base to expedite the inference of pre-trained DPs. In concrete, RAGDP encodes observation-action pairs through the DP encoder to construct a vector database of expert demonstrations. During inference, the current observation is embedded, and the most similar expert action is extracted. This extracted action is combined with an intermediate noise removal step to reduce the number of steps required compared to the original diffusion step. We show that by using RAGDP with the base model and existing acceleration methods, we improve the accuracy and speed trade-off with no additional training. Even when accelerating the models 20 times, RAGDP maintains an advantage in accuracy, with a 7% increase over distillation models such as CP.
中文: RAGDP框架通过利用专家演示的知识库来加速预训练扩散策略的推理,无需额外训练即可提升速度和准确性,即使在20倍加速下,其准确性仍比蒸馏方法高出7%。
English: The RAGDP framework enhances pre-trained Diffusion Policies by using a knowledge base of expert demonstrations to expedite inference, improving both speed and accuracy without additional training, even achieving a 7% accuracy boost over distillation methods at 20x acceleration.
Authors:Yuqi Yang, Weiqi Wang, Baixuan Xu, Wei Fan, Qing Zong, Chunkit Chan, Zheye Deng, Xin Liu, Yifan Gao, Changlong Yu, Chen Luo, Yang Li, Zheng Li, Qingyu Yin, Bing Yin, Yangqiu Song
Abstract:
Session history is a common way of recording user interacting behaviors throughout a browsing activity with multiple products. For example, if an user clicks a product webpage and then leaves, it might because there are certain features that don't satisfy the user, which serve as an important indicator of on-the-spot user preferences. However, all prior works fail to capture and model customer intention effectively because insufficient information exploitation and only apparent information like descriptions and titles are used. There is also a lack of data and corresponding benchmark for explicitly modeling intention in E-commerce product purchase sessions. To address these issues, we introduce the concept of an intention tree and propose a dataset curation pipeline. Together, we construct a sibling multimodal benchmark, SessionIntentBench, that evaluates L(V)LMs' capability on understanding inter-session intention shift with four subtasks. With 1,952,177 intention entries, 1,132,145 session intention trajectories, and 13,003,664 available tasks mined using 10,905 sessions, we provide a scalable way to exploit the existing session data for customer intention understanding. We conduct human annotations to collect ground-truth label for a subset of collected data to form an evaluation gold set. Extensive experiments on the annotated data further confirm that current L(V)LMs fail to capture and utilize the intention across the complex session setting. Further analysis show injecting intention enhances LLMs' performances.
中文: 会话历史记录用户浏览行为,但现有方法因信息利用不足而未能有效建模客户意图;为此我们提出意图树概念并构建SessionIntentBench多模态基准,通过海量会话数据评估AI对电商意图转移的理解能力。
English: Session history records user interactions during browsing but prior methods inadequately model customer intention due to limited data exploitation, leading to the creation of SessionIntentBench—a multimodal benchmark with intention trees and curated datasets to evaluate AI's understanding of intention shifts in e-commerce sessions.
Authors:Yusuke Hirota, Boyi Li, Ryo Hachiuma, Yueh-Hua Wu, Boris Ivanovic, Yuta Nakashima, Marco Pavone, Yejin Choi, Yu-Chiang Frank Wang, Chao-Han Huck Yang
Abstract:
Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (\eg, hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.
中文摘要:LOTUS是一个新型评估平台,旨在全面评估大型视觉语言模型生成的详细图像描述,弥补了现有评估在标准化标准、偏见意识和用户偏好方面的不足,同时发现没有单一模型在所有方面表现最佳,最优选择取决于用户的具体需求。
English Summary: LOTUS is a new leaderboard designed to comprehensively evaluate detailed image captions generated by Large Vision-Language Models, addressing gaps in standardized criteria, bias awareness, and user preferences, while revealing that no single model performs best across all aspects and optimal selection depends on user priorities.
Authors:Qirong Yang, Yucheng Guo, Zicheng Liu, Yujie Yang, Qijin Yin, Siyuan Li, Shaomin Ji, Linlin Chao, Xiaoming Zhang, Stan Z. Li
Abstract:
The modeling of genomic sequences presents unique challenges due to their length and structural complexity. Traditional sequence models struggle to capture long-range dependencies and biological features inherent in DNA. In this work, we propose TrinityDNA, a novel DNA foundational model designed to address these challenges. The model integrates biologically informed components, including Groove Fusion for capturing DNA's structural features and Gated Reverse Complement (GRC) to handle the inherent symmetry of DNA sequences. Additionally, we introduce a multi-scale attention mechanism that allows the model to attend to varying levels of sequence dependencies, and an evolutionary training strategy that progressively adapts the model to both prokaryotic and eukaryotic genomes. TrinityDNA provides a more accurate and efficient approach to genomic sequence modeling, offering significant improvements in gene function prediction, regulatory mechanism discovery, and other genomics applications. Our model bridges the gap between machine learning techniques and biological insights, paving the way for more effective analysis of genomic data. Additionally, we introduced a new DNA long-sequence CDS annotation benchmark to make evaluations more comprehensive and oriented toward practical applications.
中文: TrinityDNA是一种创新的DNA基础模型,它整合了生物信息组件和多尺度注意力机制,能有效捕捉基因组序列的长程依赖和结构特征,显著提升了基因功能预测与调控机制发现的准确性。
English: TrinityDNA is a novel foundational model that integrates biologically informed components and a multi-scale attention mechanism to effectively capture long-range dependencies and structural features in genomic sequences, significantly improving accuracy in gene function prediction and regulatory discovery.
Authors:Shanghai AI Lab, :, Yicheng Bao, Guanxu Chen, Mingkang Chen, Yunhao Chen, Chiyu Chen, Lingjie Chen, Sirui Chen, Xinquan Chen, Jie Cheng, Yu Cheng, Dengke Deng, Yizhuo Ding, Dan Ding, Xiaoshan Ding, Yi Ding, Zhichen Dong, Lingxiao Du, Yuyu Fan, Xinshun Feng, Yanwei Fu, Yuxuan Gao, Ruijun Ge, Tianle Gu, Lujun Gui, Jiaxuan Guo, Qianxi He, Yuenan Hou, Xuhao Hu, Hong Huang, Kaichen Huang, Shiyang Huang, Yuxian Jiang, Shanzhe Lei, Jie Li, Lijun Li, Hao Li, Juncheng Li, Xiangtian Li, Yafu Li, Lingyu Li, Xueyan Li, Haotian Liang, Dongrui Liu, Qihua Liu, Zhixuan Liu, Bangwei Liu, Huacan Liu, Yuexiao Liu, Zongkai Liu, Chaochao Lu, Yudong Lu, Xiaoya Lu, Zhenghao Lu, Qitan Lv, Caoyuan Ma, Jiachen Ma, Xiaoya Ma, Zhongtian Ma, Lingyu Meng, Ziqi Miao, Yazhe Niu, Yuezhang Peng, Yuan Pu, Han Qi, Chen Qian, Xingge Qiao, Jingjing Qu, Jiashu Qu, Wanying Qu, Wenwen Qu, Xiaoye Qu, Qihan Ren, Qingnan Ren, Qingyu Ren, Jing Shao, Wenqi Shao, Shuai Shao, Dongxing Shi, Xin Song, Xinhao Song, Yan Teng, Xuan Tong, Yingchun Wang, Xuhong Wang, Shujie Wang, Xin Wang, Yige Wang, Yixu Wang, Yuanfu Wang, Futing Wang, Ruofan Wang, Wenjie Wang, Yajie Wang, Muhao Wei, Xiaoyu Wen, Fenghua Weng, Yuqi Wu, Yingtong Xiong, Xingcheng Xu, Chao Yang, Yue Yang, Yang Yao, Yulei Ye, Zhenyun Yin, Yi Yu, Bo Zhang, Qiaosheng Zhang, Jinxuan Zhang, Yexin Zhang, Yinqiang Zheng, Hefeng Zhou, Zhanhui Zhou, Pengyu Zhu, Qingzi Zhu, Yubo Zhu, Bowen Zhou
Abstract:
We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha' moments. Notably, SafeWork-R1 achieves an average improvement of $46.54\%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.
中文: SafeWork-R1是一种多模态推理模型,通过SafeLadder框架实现了能力与安全性的协同共进化,在不牺牲性能的前提下显著提升了安全表现,相比主流模型达到了最先进的安全水平。
English: SafeWork-R1 is a multimodal reasoning model that achieves synergistic coevolution of capabilities and safety through the SafeLadder framework, demonstrating significant safety improvements without compromising performance and establishing state-of-the-art results compared to leading models.
Authors:Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi
Abstract:
Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model's reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model's support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.
中文摘要:当前RLVR方法虽能提升精确度,却可能因受限于基础模型的初始知识分布而阻碍推理边界的真正拓展,未来需通过算法创新突破这种隐形约束。
English Summary: Current RLVR practices enhance precision but may limit reasoning expansion by constraining solutions to the base model's initial knowledge, requiring future innovations for true boundary extension.
Authors:Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi
Abstract:
Recent advances in LLMs highlight RLVR as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether the current practice of RLVR truly expands a model's reasoning boundary or mainly amplifies high-reward outputs that the base model already knows for improved precision. This study presents an empirical investigation that provides fresh insights into the potential limits of the common practice of RLVR. We examine how, under current training conditions, RLVR can operate as a support-constrained optimization mechanism that may restrict the discovery of entirely original solutions, remaining constrained by the base model's initial distribution. We also identify an entropy-reward trade-off: while the current RLVR recipe reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while the current RLVR recipe consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy - resulting in greater uncertainty at each generation step - answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of the current RLVR recipe in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.
中文摘要:当前RLVR方法虽能提升精确度,却可能因受限于基础模型的初始知识分布而阻碍推理边界的真正拓展,未来需通过算法创新突破这种隐形约束。
English Summary: Current RLVR practices enhance precision but may limit reasoning expansion by constraining solutions to the base model's initial knowledge, requiring future innovations for true boundary extension.
Authors:Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yifan Yang, Yujun Shen, Hujun Bao, Xiaowei Zhou
Abstract:
This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: https://diffuman4d.github.io/ .
Chinese: 本文提出了一种滑动迭代去噪方法,通过增强4D扩散模型的时空一致性,从稀疏视角视频中合成高质量的人体新视角视频,在多个数据集上显著优于现有方法。
English: This paper introduces a sliding iterative denoising technique to improve the spatio-temporal consistency of 4D diffusion models for high-fidelity human view synthesis from sparse-view videos, achieving superior results on benchmark datasets.
Authors:Jinqiu Jin, Yang Zhang, Fuli Feng, Xiangnan He
Abstract:
Recently, there has been a surge of interest in Multi-Target Cross-Domain Recommendation (MTCDR), which aims to enhance recommendation performance across multiple domains simultaneously. Existing MTCDR methods primarily rely on domain-shared entities (\eg users or items) to fuse and transfer cross-domain knowledge, which may be unavailable in non-overlapped recommendation scenarios. Some studies model user preferences and item features as domain-sharable semantic representations, which can be utilized to tackle the MTCDR task. Nevertheless, they often require extensive auxiliary data for pre-training. Developing more effective solutions for MTCDR remains an important area for further exploration.
Inspired by recent advancements in generative recommendation, this paper introduces GMC, a generative paradigm-based approach for multi-target cross-domain recommendation. The core idea of GMC is to leverage semantically quantized discrete item identifiers as a medium for integrating multi-domain knowledge within a unified generative model. GMC first employs an item tokenizer to generate domain-shared semantic identifiers for each item, and then formulates item recommendation as a next-token generation task by training a domain-unified sequence-to-sequence model. To further leverage the domain information to enhance performance, we incorporate a domain-aware contrastive loss into the semantic identifier learning, and perform domain-specific fine-tuning on the unified recommender. Extensive experiments on five public datasets demonstrate the effectiveness of GMC compared to a range of baseline methods.
中文: 本文提出GMC,一种基于生成范式的多目标跨域推荐方法,通过语义化项目标识和统一序列生成框架,结合领域感知对比学习和微调,有效提升了跨域推荐性能。
English: This paper introduces GMC, a generative model for multi-target cross-domain recommendation that uses semantic item identifiers and a unified sequence-to-sequence framework, enhanced by domain-aware contrastive learning and fine-tuning to improve performance across domains.
Authors:Mingjie Liu, Shizhe Diao, Jian Hu, Ximing Lu, Xin Dong, Hao Zhang, Alexander Bukharin, Shaokun Zhang, Jiaqi Zeng, Makesh Narsimhan Sreedhar, Gerald Shen, David Mosallanezhad, Di Zhang, Jonas Yang, June Yang, Oleksii Kuchaiev, Guilin Liu, Zhiding Yu, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong
Abstract:
Recent advancements in reasoning-focused language models such as OpenAI's O1 and DeepSeek-R1 have shown that scaling test-time computation-through chain-of-thought reasoning and iterative exploration-can yield substantial improvements on complex tasks like mathematics and code generation. These breakthroughs have been driven by large-scale reinforcement learning (RL), particularly when combined with verifiable reward signals that provide objective and grounded supervision. In this report, we investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains. Our work identifies several key ingredients for effective training, including the use of verifiable reward tasks, enhancements to Group Relative Policy Optimization (GRPO), and practical techniques to improve training stability and generalization. We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains. Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks. To facilitate continued research, we release our model publicly.
中文摘要:最新研究表明,通过强化学习结合可验证奖励信号,并优化训练策略,能显著提升小型语言模型在数学、编程和逻辑推理等复杂任务中的表现,其中数学任务提升14.7%,编程提升13.9%,逻辑谜题提升54.8%。
English Summary: Recent reasoning-focused language models demonstrate that scaling test-time computation and using large-scale reinforcement learning with verifiable rewards significantly enhance performance in mathematics, coding, and logic tasks.
Authors:Zhen Xu, Hongyu Zhou, Sida Peng, Haotong Lin, Haoyu Guo, Jiahao Shao, Peishan Yang, Qinglin Yang, Sheng Miao, Xingyi He, Yifan Wang, Yue Wang, Ruizhen Hu, Yiyi Liao, Xiaowei Zhou, Hujun Bao
Abstract:
Depth estimation is a fundamental task in 3D computer vision, crucial for applications such as 3D reconstruction, free-viewpoint rendering, robotics, autonomous driving, and AR/VR technologies. Traditional methods relying on hardware sensors like LiDAR are often limited by high costs, low resolution, and environmental sensitivity, limiting their applicability in real-world scenarios. Recent advances in vision-based methods offer a promising alternative, yet they face challenges in generalization and stability due to either the low-capacity model architectures or the reliance on domain-specific and small-scale datasets. The emergence of scaling laws and foundation models in other domains has inspired the development of "depth foundation models": deep neural networks trained on large datasets with strong zero-shot generalization capabilities. This paper surveys the evolution of deep learning architectures and paradigms for depth estimation across the monocular, stereo, multi-view, and monocular video settings. We explore the potential of these models to address existing challenges and provide a comprehensive overview of large-scale datasets that can facilitate their development. By identifying key architectures and training strategies, we aim to highlight the path towards robust depth foundation models, offering insights into their future research and applications.
深度估计是三维视觉的关键任务,基于大规模数据训练的基础模型有望突破传统硬件限制,实现强大的零样本泛化能力。
Depth estimation is a crucial 3D vision task where foundation models trained on large datasets show promise in overcoming traditional hardware limitations and achieving robust zero-shot generalization.
Authors:Tao Feng, Haozhen Zhang, Zijie Lei, Pengrui Han, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jiaxuan You
Abstract:
The rapid advancement of large language models (LLMs) has created a vibrant ecosystem of diverse architectures, each with unique strengths due to differences in design, training data, and objectives. However, most applications still rely on a single backend model, limiting coverage of capabilities and leading to inefficiencies in performance and token cost when tackling complex tasks. We highlight an underexploited opportunity: LLM routing data, produced when hosting platforms route diverse queries to different models, which can reveal comparative strengths across tasks. To address this, we propose FusionBench, a comprehensive routing benchmark covering 14 tasks across five domains with 20 open-source LLMs (8B to 671B parameters), capturing 103M tokens and summarizing reusable thought templates from top models. Building on this, we introduce FusionFactory, a systematic fusion framework with three levels: (1) query-level fusion, tailoring routers for each query using both direct responses and reasoning-augmented outputs; (2) thought-level fusion, leveraging abstract templates derived from top-performing LLMs' answers to similar queries; and (3) model-level fusion, transferring capabilities between models via distillation, using top responses or highest judge scores as training data. Experiments show FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with optimal fusion configurations varying by benchmark, demonstrating the value of systematic LLM fusion in harnessing complementary strengths and improving overall performance.
中文: 该摘要介绍了LLMFusionBench这一融合多种大语言模型(LLM)的综合基准,以及FusionFactory框架——通过查询级、思维级和模型级融合技术利用多LLM日志数据,在各类任务中持续超越单一模型性能。
English: The abstract introduces LLMFusionBench, a comprehensive benchmark for fusing multiple large language models (LLMs), and FusionFactory, a framework that leverages multi-LLM log data through query-, thought-, and model-level fusion to consistently outperform individual models across diverse tasks.
Authors:Tao Feng, Haozhen Zhang, Zijie Lei, Pengrui Han, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jiaxuan You
Abstract:
The rapid advancement of large language models (LLMs) has created a diverse landscape of models, each excelling at different tasks. This diversity drives researchers to employ multiple LLMs in practice, leaving behind valuable multi-LLM log data. This naturally leads to the question of whether such logs can be fully leveraged to fuse LLMs' complementary capabilities. Although prior work has explored various strategies for integrating multiple LLMs, we argue that practical fusion must meet two essential requirements: (1) compatibility with real-world serving scenarios (e.g., local and API-based serving), and (2) flexibility to operate at different stages of the LLM pipeline to meet varied user needs (e.g., fine-tuning and inference stages). To this end, we introduce LLMFusionBench, a large-scale benchmark for LLM fusion that spans 14 tasks across five domains, with responses from 20 open-source LLMs (8B--671B) totaling 103M tokens. Building on LLMFusionBench, we propose FusionFactory, a systematic framework with three elaborated levels: (1) query-level fusion via tailored LLM routers, (2) thought-level fusion leveraging retrieved abstract reasoning templates, and (3) model-level fusion via distillation from top-ranked responses. Experiments show that FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with the optimal fusion configuration varying across benchmarks, highlighting the promise of multi-LLM log data as a practical foundation for fusing diverse LLM capabilities.
中文: 该摘要介绍了LLMFusionBench这一融合多种大语言模型(LLM)的综合基准,以及FusionFactory框架——通过查询级、思维级和模型级融合技术利用多LLM日志数据,在各类任务中持续超越单一模型性能。
English: The abstract introduces LLMFusionBench, a comprehensive benchmark for fusing multiple large language models (LLMs), and FusionFactory, a framework that leverages multi-LLM log data through query-, thought-, and model-level fusion to consistently outperform individual models across diverse tasks.
Authors:Junyu Luo, Yuhao Tang, Yiwei Fu, Xiao Luo, Zhizhuo Kou, Zhiping Xiao, Wei Ju, Wentao Zhang, Ming Zhang
Abstract:
Unsupervised Graph Domain Adaptation (UGDA) leverages labeled source domain graphs to achieve effective performance in unlabeled target domains despite distribution shifts. However, existing methods often yield suboptimal results due to the entanglement of causal-spurious features and the failure of global alignment strategies. We propose SLOGAN (Sparse Causal Discovery with Generative Intervention), a novel approach that achieves stable graph representation transfer through sparse causal modeling and dynamic intervention mechanisms. Specifically, SLOGAN first constructs a sparse causal graph structure, leveraging mutual information bottleneck constraints to disentangle sparse, stable causal features while compressing domain-dependent spurious correlations through variational inference. To address residual spurious correlations, we innovatively design a generative intervention mechanism that breaks local spurious couplings through cross-domain feature recombination while maintaining causal feature semantic consistency via covariance constraints. Furthermore, to mitigate error accumulation in target domain pseudo-labels, we introduce a category-adaptive dynamic calibration strategy, ensuring stable discriminative learning. Extensive experiments on multiple real-world datasets demonstrate that SLOGAN significantly outperforms existing baselines.
中文: SLOGAN提出了一种新颖的无监督图域自适应方法,通过稀疏因果建模和生成式干预机制解耦稳定因果特征与伪相关性,在多个真实数据集上显著超越了现有基线。
English: SLOGAN introduces a novel unsupervised graph domain adaptation method that utilizes sparse causal modeling and generative interventions to disentangle stable causal features from spurious correlations, achieving superior performance across multiple datasets.
Authors:Keqin Bao, Nuo Chen, Xiaoyuan Li, Binyuan Hui, Bowen Yu, Fuli Feng, Xiangnan He, Dayiheng Liu
Abstract:
Enhancing reasoning capabilities remains a central focus in the LLM reasearch community. A promising direction involves requiring models to simulate code execution step-by-step to derive outputs for given inputs. However, as code is often designed for large-scale systems, direct application leads to over-reliance on complex data structures and algorithms, even for simple cases, resulting in overfitting to algorithmic patterns rather than core reasoning structures. To address this, we propose TeaR, which aims at teaching LLMs to reason better. TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks, thereby improving general reasoning abilities. We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning. The results consistently show significant performance improvements. Notably, TeaR achieves a 35.9% improvement on Qwen2.5-7B and 5.9% on R1-Distilled-7B.
中文: TeaR通过精心设计的数据和强化学习引导大模型在代码任务中发现最优推理路径,从而显著提升其在数学、知识、代码和逻辑推理等多领域的性能表现。
English: TeaR enhances LLMs' reasoning by guiding them through code-related tasks with curated data and reinforcement learning, achieving significant performance gains across diverse benchmarks.
Authors:Lan Li, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan
Abstract:
Domain-Incremental Learning (DIL) focuses on continual learning in non-stationary environments, requiring models to adjust to evolving domains while preserving historical knowledge. DIL faces two critical challenges in the context of imbalanced data: intra-domain class imbalance and cross-domain class distribution shifts. These challenges significantly hinder model performance, as intra-domain imbalance leads to underfitting of few-shot classes, while cross-domain shifts require maintaining well-learned many-shot classes and transferring knowledge to improve few-shot class performance in old domains. To overcome these challenges, we introduce the Dual-Balance Collaborative Experts (DCE) framework. DCE employs a frequency-aware expert group, where each expert is guided by specialized loss functions to learn features for specific frequency groups, effectively addressing intra-domain class imbalance. Subsequently, a dynamic expert selector is learned by synthesizing pseudo-features through balanced Gaussian sampling from historical class statistics. This mechanism navigates the trade-off between preserving many-shot knowledge of previous domains and leveraging new data to improve few-shot class performance in earlier tasks. Extensive experimental results on four benchmark datasets demonstrate DCE's state-of-the-art performance.
中文摘要:DCE框架通过频率感知专家组解决域内类别不平衡,并利用基于高斯采样的动态专家选择器在保持旧域知识与提升少样本类别性能间取得平衡,在领域增量学习中实现先进性能。
English Summary: The DCE framework tackles domain-incremental learning challenges by using frequency-aware experts to handle intra-domain class imbalance and a dynamic selector that balances knowledge preservation and transfer through Gaussian sampling.
Authors:Ziang Ye, Yang Zhang, Wentao Shi, Xiaoyu You, Fuli Feng, Tat-Seng Chua
Abstract:
Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agent-mapping textual plans to GUI elements-can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent's behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.
中文: 基于大型视觉语言模型的图形用户界面代理面临新的安全威胁,其视觉定位功能易受后门攻击,仅需少量污染数据和隐蔽触发即可被劫持操作,且攻击可跨环境持续生效。
English: GUI agents using large vision-language models face new security risks from backdoor attacks that exploit visual grounding vulnerabilities, allowing attackers to hijack their actions with minimal poisoned data and stealthy triggers, even across different environments.
Authors:Yue Tan, Xiaoqian Hu, Hao Xue, Celso De Melo, Flora D. Salim
Abstract:
Frontier vision-language models (VLMs) have made remarkable improvements in video understanding tasks. However, real-world videos typically exist as continuously evolving data streams (e.g., dynamic scenes captured by wearable glasses), necessitating models to continually adapt to shifting data distributions and novel scenarios. Considering the prohibitive computational costs of fine-tuning models on new tasks, usually, a small subset of parameters is updated while the bulk of the model remains frozen. This poses new challenges to existing continual learning frameworks in the context of large multimodal foundation models, i.e., catastrophic forgetting and update conflict. While the foundation models struggle with parameter-efficient continual learning, the hippocampus in the human brain has evolved highly efficient mechanisms for memory formation and consolidation. Inspired by the rapid Binding and pattern separation mechanisms in the hippocampus, in this work, we propose Bisecle for video-language continual learning, where a multi-directional supervision module is used to capture more cross-modal relationships and a contrastive prompt learning scheme is designed to isolate task-specific knowledge to facilitate efficient memory storage. Binding and separation processes further strengthen the ability of VLMs to retain complex experiences, enabling robust and efficient continual learning in video understanding tasks. We perform a thorough evaluation of the proposed Bisecle, demonstrating its ability to mitigate forgetting and enhance cross-task generalization on several VideoQA benchmarks.
中文: 前沿视觉语言模型在适应不断演变的视频流时面临效率挑战,而受海马体记忆机制启发的Bisecle方法通过捕捉跨模态关联和隔离任务特定知识,增强了持续学习能力,有效减轻遗忘并提升泛化性能。
English: Frontier vision-language models face challenges in adapting to evolving video streams efficiently, but the proposed Bisecle method, inspired by hippocampal memory mechanisms, enhances continual learning by capturing cross-modal relationships and isolating task-specific knowledge to reduce forgetting and improve generalization.
Authors:Bo Wang, Qinyuan Cheng, Runyu Peng, Rong Bao, Peiji Li, Qipeng Guo, Linyang Li, Zhiyuan Zeng, Yunhua Zhou, Xipeng Qiu
Abstract:
Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical framework bridging Supervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training. Through rigorous mathematical derivation, we demonstrate that both SFT and preference learning methods like Direct Preference Optimization (DPO) operate within the same optimal policy-reward subspace, with SFT representing a special case of implicit reward learning. Our analysis reveals a critical limitation in conventional SFT: the KL divergence term in distribution matching becomes constant with respect to the policy during optimization, failing to constrain model updates. To address this, we propose a simple yet effective learning rate reduction approach that yields significant performance improvements (up to \textbf{25\%} relative gain and \textbf{6\%} absolute win rate increase in instruction following tasks. Additionally, we derive alternative SFT objectives from various f-divergence functions that preserve the KL term during optimization, further enhancing post-DPO model performance. Finally, we extend the theoretical relationship between LLM logits and Q-functions from preference learning to the SFT context, providing mathematical derivations and experimental validation.
Chinese: 本研究提出了一个统一的理论框架,将大语言模型后训练中的监督微调与偏好学习联系起来,揭示了它们共享的最优策略-奖励子空间,并提出了一种学习率降低方法,使模型性能显著提升,相对增益高达25%。
English: This study introduces a unified theoretical framework that links Supervised Fine-Tuning and preference learning in LLM post-training, revealing their shared optimal policy-reward subspace and proposing a learning rate reduction method that significantly improves model performance by up to 25% relative gain.
Authors:Songsheng Wang, Rucheng Yu, Zhihang Yuan, Chao Yu, Feng Gao, Yu Wang, Derek F. Wong
Abstract:
Vision-Language-Action (VLA) models have made substantial progress by leveraging the robust capabilities of Visual Language Models (VLMs). However, VLMs' significant parameter size and autoregressive (AR) decoding nature impose considerable computational demands on VLA models. While Speculative Decoding (SD) has shown efficacy in accelerating Large Language Models (LLMs) by incorporating efficient drafting and parallel verification, allowing multiple tokens to be generated in one forward pass, its application to VLA models remains unexplored. This work introduces Spec-VLA, an SD framework designed to accelerate VLA models. Due to the difficulty of the action prediction task and the greedy decoding mechanism of the VLA models, the direct application of the advanced SD framework to the VLA prediction task yields a minor speed improvement. To boost the generation speed, we propose an effective mechanism to relax acceptance utilizing the relative distances represented by the action tokens of the VLA model. Empirical results across diverse test scenarios affirm the effectiveness of the Spec-VLA framework, and further analysis substantiates the impact of our proposed strategies, which enhance the acceptance length by 44%, achieving 1.42 times speedup compared with the OpenVLA baseline, without compromising the success rate. The success of the Spec-VLA framework highlights the potential for broader application of speculative execution in VLA prediction scenarios.
中文: 本文提出Spec-VLA推测解码框架,通过放宽动作令牌接受标准来加速视觉-语言-动作模型,在保持成功率的同时实现比基线1.42倍的加速效果。
English: This paper introduces Spec-VLA, a speculative decoding framework that accelerates Vision-Language-Action models by relaxing token acceptance criteria, achieving a 1.42x speedup over baseline without performance loss.
Authors:Zihan Zhao, Bo Chen, Ziping Wan, Lu Chen, Xuanze Lin, Shiyang Yu, Situo Zhang, Da Ma, Zichen Zhu, Danyang Zhang, Huayang Wang, Zhongyang Dai, Liyang Wen, Xin Chen, Kai Yu
Abstract:
While large language models (LLMs) have achieved impressive progress, their application in scientific domains such as chemistry remains hindered by shallow domain understanding and limited reasoning capabilities. In this work, we focus on the specific field of chemistry and develop a Chemical Reasoner LLM, ChemDFM-R. We first construct a comprehensive dataset of atomized knowledge points to enhance the model's understanding of the fundamental principles and logical structure of chemistry. Then, we propose a mix-sourced distillation strategy that integrates expert-curated knowledge with general-domain reasoning skills, followed by domain-specific reinforcement learning to enhance chemical reasoning. Experiments on diverse chemical benchmarks demonstrate that ChemDFM-R achieves cutting-edge performance while providing interpretable, rationale-driven outputs. Further case studies illustrate how explicit reasoning chains significantly improve the reliability, transparency, and practical utility of the model in real-world human-AI collaboration scenarios.
中文:本研究开发了化学推理大模型ChemDFM-R,通过原子化知识构建和混合来源蒸馏策略提升领域理解,在化学基准测试中取得领先性能并提供可解释的输出结果。
English: The study introduces ChemDFM-R, a chemical reasoning LLM that enhances domain understanding through atomized knowledge and a mix-sourced distillation strategy, achieving top performance and interpretable outputs in chemical benchmarks.
Authors:Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu
Abstract:
Large Language Models (LLMs) rely on attention mechanisms whose time complexity grows quadratically with input sequence length, creating significant computational bottlenecks during the prefilling stage. Existing static sparse attention methods typically degrade accuracy, while dynamic sparsity methods introduce additional computational overhead due to runtime sparse index estimation. To address these limitations, we propose TriangleMix, a novel training-free static attention pattern. TriangleMix employs dense attention in shallow layers and switches to a triangle-shaped sparse pattern in deeper layers. Extensive experiments demonstrate that TriangleMix reduces attention overhead by 3.7x to 15.3x in deep layers, and decreases overall Time-to-First-Token (TTFT) by 12% to 32% for sequence lengths ranging from 32K to 128K, without sacrificing model accuracy. Moreover, TriangleMix can be seamlessly integrated with dynamic sparsity methods to achieve further speedup, e.g. accelerating MInference by 19% at 128K, highlighting its potential to enhance LLM inference efficiency.
中文:TriangleMix是一种无需训练的静态注意力模式,在浅层采用密集注意力,在深层切换为三角形稀疏注意力,可将深层注意力开销降低3.7至15.3倍,并在32K至128K长序列上使首字符生成时间减少12%至32%,且不损失模型精度。
English: TriangleMix is a training-free static attention pattern that uses dense attention in shallow layers and switches to triangle-shaped sparse attention in deeper layers, reducing attention overhead by 3.7x to 15.3x and cutting Time-to-First-Token by 12% to 32% for long sequences without compromising accuracy.
Authors:Sarah Lockfisch, Kristian Schwethelm, Martin Menten, Rickmer Braren, Daniel Rueckert, Alexander Ziller, Georgios Kaissis
Abstract:
Model multiplicity refers to the existence of multiple machine learning models that describe the data equally well but may produce different predictions on individual samples. In medicine, these models can admit conflicting predictions for the same patient -- a risk that is poorly understood and insufficiently addressed.
In this study, we empirically analyze the extent, drivers, and ramifications of predictive multiplicity across diverse medical tasks and model architectures, and show that even small ensembles can mitigate/eliminate predictive multiplicity in practice. Our analysis reveals that (1) standard validation metrics fail to identify a uniquely optimal model and (2) a substantial amount of predictions hinges on arbitrary choices made during model development. Using multiple models instead of a single model reveals instances where predictions differ across equally plausible models -- highlighting patients that would receive arbitrary diagnoses if any single model were used. In contrast, (3) a small ensemble paired with an abstention strategy can effectively mitigate measurable predictive multiplicity in practice; predictions with high inter-model consensus may thus be amenable to automated classification. While accuracy is not a principled antidote to predictive multiplicity, we find that (4) higher accuracy achieved through increased model capacity reduces predictive multiplicity.
Our findings underscore the clinical importance of accounting for model multiplicity and advocate for ensemble-based strategies to improve diagnostic reliability. In cases where models fail to reach sufficient consensus, we recommend deferring decisions to expert review.
中文: 医学中的模型多重性会导致对同一患者产生矛盾预测,但采用小型集成模型与弃权策略可有效缓解此问题并提升诊断可靠性。
English: Model multiplicity in medicine leads to conflicting predictions for patients, but small ensembles with abstention strategies can effectively mitigate this issue and improve diagnostic reliability.
Authors:Ning Liao, Xiaoxing Wang, Zehao Lin, Weiyang Guo, Feng Hong, Shixiang Song, Geng Yu, Zihua Zhao, Sitao Xie, Longxuan Wei, Xiangqi Jin, Xiaohan Qin, Jiale Ma, Kai Chen, Jiangchao Yao, Zhouhan Lin, Junchi Yan, Zhiyu Li, Feiyu Xiong, Yanfeng Wang, Linfeng Zhang
Abstract:
A large language model (LLM) with knowledge in both scientific and general tasks is the foundation of science general intelligence. However, directly continued pretraining an LLM using science data usually leads to catastrophic forgetting, which indicates severe degradation in general ability. In this report, we present Innovator, which solves this problem by upcycling a pre-trained dense LLM into a fine-grained Mixtures-of-Experts model during continued pretraining, where different experts are expected to learn science knowledge in different disciplines, and a shared expert is utilized for general tasks. Innovator introduces a four-stage upcycle training paradigm: (1) Scientific Expert Induction on discipline-specific data, (2) Fine-grained Expert Splitting via FFN dimension decomposition, (3) Science-Aware Routing warmup, and (4) Generalist-Scientist Integration training on hybrid datasets. Such a paradigm enables knowledge in the general domain, and different scientific disciplines can be decoupled, avoiding the negative influence among knowledge in different domains. With 53.3B total parameters and 13.3B activated, Innovator extends Qwen2.5-7B using a shared general expert and 64 specialized scientific experts with 8 activated. Trained on 300B tokens with tri-level quality-controlled data, Innovator achieves 25% average improvement across 30 scientific tasks with a win rate as 70%, while retaining 99% performance in general tasks. Furthermore, Innovator-Reason, which is post-trained from Innovator for reasoning boosting, exhibits excellent reasoning performance in solving complex scientific problems with improvements over 30%.
中文摘要:Innovator通过将稠密大语言模型升级为细粒度专家混合模型,在持续科学预训练中避免了灾难性遗忘,在显著提升科学任务表现的同时保持了通用能力。
English Summary: Innovator is a fine-grained Mixtures-of-Experts model that upcycles a dense LLM to prevent catastrophic forgetting during science-focused continued pretraining, achieving significant improvements in scientific tasks while maintaining general capabilities.
Authors:Rameen Abdal, Or Patashnik, Ekaterina Deyneka, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman
Abstract:
Recent advances in text-to-video generation have enabled high-quality synthesis from text and image prompts. While the personalization of dynamic concepts, which capture subject-specific appearance and motion from a single video, is now feasible, most existing methods require per-instance fine-tuning, limiting scalability. We introduce a fully zero-shot framework for dynamic concept personalization in text-to-video models. Our method leverages structured 2x2 video grids that spatially organize input and output pairs, enabling the training of lightweight Grid-LoRA adapters for editing and composition within these grids. At inference, a dedicated Grid Fill module completes partially observed layouts, producing temporally coherent and identity preserving outputs. Once trained, the entire system operates in a single forward pass, generalizing to previously unseen dynamic concepts without any test-time optimization. Extensive experiments demonstrate high-quality and consistent results across a wide range of subjects beyond trained concepts and editing scenarios.
中文: 本文提出了一种零样本动态概念个性化框架,通过Grid-LoRA适配器和网格填充模块,无需测试时优化即可实现可扩展的高质量视频生成。
English: The paper introduces a zero-shot framework for dynamic concept personalization in text-to-video generation, using Grid-LoRA adapters and a Grid Fill module to achieve scalable, high-quality video synthesis without test-time optimization.
Authors:Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang
Abstract:
Process Reward Models (PRMs) are crucial for guiding Large Language Models (LLMs) in complex scenarios by providing dense reward signals. However, existing PRMs primarily rely on heuristic approaches, which struggle with cross-domain generalization. While LLM-as-judge has been proposed to provide generalized rewards, current research has focused mainly on feedback results, overlooking the meaningful guidance embedded within the text. Additionally, static and coarse-grained evaluation criteria struggle to adapt to complex process supervision. To tackle these challenges, we propose Dynamic and Generalizable Process Reward Modeling (DG-PRM), which features a reward tree to capture and store fine-grained, multi-dimensional reward criteria. DG-PRM dynamically selects reward signals for step-wise reward scoring. To handle multifaceted reward signals, we pioneeringly adopt Pareto dominance estimation to identify discriminative positive and negative pairs. Experimental results show that DG-PRM achieves stunning performance on prevailing benchmarks, significantly boosting model performance across tasks with dense rewards. Further analysis reveals that DG-PRM adapts well to out-of-distribution scenarios, demonstrating exceptional generalizability.
中文摘要:提出的DG-PRM模型通过奖励树结构和帕累托优势估计,解决了现有过程奖励模型的局限性,在不同基准测试和任务中实现了卓越性能与泛化能力。
English Summary: The proposed DG-PRM model introduces a reward tree structure and Pareto dominance estimation to overcome limitations in existing process reward models, achieving superior performance and generalization across diverse benchmarks and tasks.
Authors:Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, Dong Zhou, Yueqing Zhuang, Bo Zhao, Guohao Dai, Yu Wang
Abstract:
We present Megrez2, a novel lightweight and high-performance language model architecture optimized for device native deployment. Megrez2 introduces a novel cross-layer expert sharing mechanism, which significantly reduces total parameter count by reusing expert modules across adjacent transformer layers while maintaining most of the model's capacity. It also incorporates pre-gated routing, enabling memory-efficient expert loading and faster inference. As the first instantiation of the Megrez2 architecture, we introduce the Megrez2-Preview model, which is pre-trained on a 5-trillion-token corpus and further enhanced through supervised fine-tuning and reinforcement learning with verifiable rewards. With only 3B activated and 7.5B stored parameters, Megrez2-Preview demonstrates competitive or superior performance compared to larger models on a wide range of tasks, including language understanding, instruction following, mathematical reasoning, and code generation. These results highlight the effectiveness of the Megrez2 architecture to achieve a balance between accuracy, efficiency, and deployability, making it a strong candidate for real-world, resource-constrained applications.
中文: Megrez2是一种轻量级高性能语言模型架构,采用跨层专家共享和预门控路由技术,以极少参数量实现卓越性能,适用于资源受限的实际应用场景。
English: Megrez2 is a lightweight, high-performance language model architecture featuring cross-layer expert sharing and pre-gated routing, achieving competitive results with minimal parameters for efficient deployment in resource-constrained applications.
Authors:Yotam Erel, Olaf Dünkel, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, Amit H. Bermano
Abstract:
We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our main observation is that tokens corresponding to semantically similar regions form a set of metastable states, where the attention clusters, while noisy attention scores tend to disperse. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank -- the steady state vector of the Markov chain, which measures global token importance. We demonstrate that using it brings improvements in unconditional image generation. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.
中文摘要:本文将注意力矩阵重新解释为马尔可夫链,揭示了语义相似标记形成亚稳态的机制,并引入衡量全局标记重要性的TokenRank方法,实现了先进的零样本分割并提升了图像生成效果。
English Summary: This paper reinterprets the attention matrix as a Markov chain, revealing how semantically similar tokens form metastable states and introducing TokenRank for measuring global token importance, achieving state-of-the-art zero-shot segmentation and improved image generation.
Authors:Situo Zhang, Hanqi Li, Lu Chen, Zihan Zhao, Xuanze Lin, Zichen Zhu, Bo Chen, Xin Chen, Kai Yu
Abstract:
Retrosynthesis planning, essential in organic synthesis and drug discovery, has greatly benefited from recent AI-driven advancements. Nevertheless, existing methods frequently face limitations in both applicability and explainability. Traditional graph-based and sequence-to-sequence models often lack generalized chemical knowledge, leading to predictions that are neither consistently accurate nor easily explainable. To address these challenges, we introduce RetroDFM-R, a reasoning-based large language model (LLM) designed specifically for chemical retrosynthesis. Leveraging large-scale reinforcement learning guided by chemically verifiable rewards, RetroDFM-R significantly enhances prediction accuracy and explainability. Comprehensive evaluations demonstrate that RetroDFM-R significantly outperforms state-of-the-art methods, achieving a top-1 accuracy of 65.0% on the USPTO-50K benchmark. Double-blind human assessments further validate the chemical plausibility and practical utility of RetroDFM-R's predictions. RetroDFM-R also accurately predicts multistep retrosynthetic routes reported in the literature for both real-world drug molecules and perovskite materials. Crucially, the model's explicit reasoning process provides human-interpretable insights, thereby enhancing trust and practical value in real-world retrosynthesis applications.
Chinese: RetroDFM-R是一种基于推理的大语言模型,通过化学验证奖励的大规模强化学习,显著提升了逆合成预测的准确性和可解释性,在USPTO-50K基准测试中达到65.0%的top-1准确率,并为实际应用提供人类可理解的分析过程。
English: RetroDFM-R, a reasoning-based large language model enhanced by reinforcement learning with chemical rewards, significantly improves retrosynthesis prediction accuracy and explainability, achieving 65.0% top-1 accuracy on USPTO-50K and providing human-interpretable insights for real-world applications.
Authors:Eyal German, Sagiv Antebi, Daniel Samira, Asaf Shabtai, Yuval Elovici
Abstract:
Large language models (LLMs) are increasingly trained on tabular data, which, unlike unstructured text, often contains personally identifiable information (PII) in a highly structured and explicit format. As a result, privacy risks arise, since sensitive records can be inadvertently retained by the model and exposed through data extraction or membership inference attacks (MIAs). While existing MIA methods primarily target textual content, their efficacy and threat implications may differ when applied to structured data, due to its limited content, diverse data types, unique value distributions, and column-level semantics. In this paper, we present Tab-MIA, a benchmark dataset for evaluating MIAs on tabular data in LLMs and demonstrate how it can be used. Tab-MIA comprises five data collections, each represented in six different encoding formats. Using our Tab-MIA benchmark, we conduct the first evaluation of state-of-the-art MIA methods on LLMs finetuned with tabular data across multiple encoding formats. In the evaluation, we analyze the memorization behavior of pretrained LLMs on structured data derived from Wikipedia tables. Our findings show that LLMs memorize tabular data in ways that vary across encoding formats, making them susceptible to extraction via MIAs. Even when fine-tuned for as few as three epochs, models exhibit high vulnerability, with AUROC scores approaching 90% in most cases. Tab-MIA enables systematic evaluation of these risks and provides a foundation for developing privacy-preserving methods for tabular data in LLMs.
中文: 大型语言模型在处理结构化表格数据时容易记忆个人身份信息,Tab-MIA基准测试表明,即使经过少量训练周期,模型仍易遭受成员推理攻击,且不同数据编码格式下的记忆模式存在显著差异。
English: Large language models trained on structured tabular data risk memorizing personally identifiable information, making them vulnerable to membership inference attacks, as demonstrated by the Tab-MIA benchmark which reveals high model susceptibility across different data encoding formats.
Authors:Zixiao Huang, Junhao Hu, Hao Lin, Chunyang Zhu, Yueran Tang, Quanlu Zhang, Zhen Guo, Zhenhua Li, Shengen Yan, Zhenhua Zhu, Guohao Dai, Yu Wang
Abstract:
The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Default GPU memory allocators of popular deep learning frameworks like PyTorch use online strategies without knowledge of tensor lifespans, which can waste up to 43\% of memory and cause out-of-memory errors, rendering optimization techniques ineffective or even unusable.
To address this, we introduce STWeaver, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STWeaver introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch allocator, STWeaver reduces fragmentation ratio on average by 79.2\% (up to 100\%) across both dense and sparse models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves performance by up to 32.5\%.
中文:STWeaver GPU内存分配器通过结合离线规划和在线分配,有效解决了大语言模型训练中的内存碎片问题,显著降低了碎片率并提升了性能,且开销极小。
English: The STWeaver GPU memory allocator addresses memory fragmentation in large language model training by combining offline planning with online allocation, significantly reducing fragmentation and improving performance with minimal overhead.
Authors:Yushu Wu, Yanyu Li, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ke Ma, Arpit Sahni, Ju Hu, Aliaksandr Siarohin, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov
Abstract:
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and real-time generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platform while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve over 10 frames per second (FPS) generation on an iPhone 16 Pro Max, demonstrating the feasibility of real-time, high-quality video generation on mobile devices.
中文: 本研究提出了压缩VAE、三重剪枝策略和对抗性步数蒸馏三项优化技术,实现了在iPhone 16 Pro Max上约15帧/秒的高效高质量移动端视频生成。
English: This work introduces three optimizations—a compressed VAE, a tri-level pruning strategy, and adversarial step distillation—that enable efficient, high-quality video generation on mobile devices, achieving approximately 15 FPS on an iPhone 16 Pro Max.
Authors:Yushu Wu, Yanyu Li, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ke Ma, Arpit Sahni, Ju Hu, Aliaksandr Siarohin, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov
Abstract:
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and practical on-device generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable practical deployment on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platforms while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve approximately 15 frames per second (FPS) generation speed on an iPhone 16 Pro Max, demonstrating the feasibility of efficient, high-quality video generation on mobile devices.
中文: 本研究提出了压缩VAE、三重剪枝策略和对抗性步数蒸馏三项优化技术,实现了在iPhone 16 Pro Max上约15帧/秒的高效高质量移动端视频生成。
English: This work introduces three optimizations—a compressed VAE, a tri-level pruning strategy, and adversarial step distillation—that enable efficient, high-quality video generation on mobile devices, achieving approximately 15 FPS on an iPhone 16 Pro Max.
Authors:Zichen Liu, Guoji Fu, Chao Du, Wee Sun Lee, Min Lin
Abstract:
Continual reinforcement learning (CRL) refers to a naturalistic setting where an agent needs to endlessly evolve, by trial and error, to solve multiple tasks that are presented sequentially. One of the largest obstacles to CRL is that the agent may forget how to solve previous tasks when learning a new task, known as catastrophic forgetting. In this paper, we propose to address this challenge by planning with online world models. Specifically, we learn a Follow-The-Leader shallow model online to capture the world dynamics, in which we plan using model predictive control to solve a set of tasks specified by any reward functions. The online world model is immune to forgetting by construction with a proven regret bound of $\mathcal{O}(\sqrt{K^2D\log(T)})$ under mild assumptions. The planner searches actions solely based on the latest online model, thus forming a FTL Online Agent (OA) that updates incrementally. To assess OA, we further design Continual Bench, a dedicated environment for CRL, and compare with several strong baselines under the same model-planning algorithmic framework. The empirical results show that OA learns continuously to solve new tasks while not forgetting old skills, outperforming agents built on deep world models with various continual learning techniques.
中文摘要:本文提出了一种跟随领导者在线智能体(OA),通过在线世界模型和模型预测控制来解决持续强化学习中的灾难性遗忘问题,并在新设计的基准测试中展现出优于基于深度模型的各类持续学习方法。
English Summary: This paper introduces a Follow-The-Leader Online Agent (OA) that uses online world models and model predictive control to overcome catastrophic forgetting in continual reinforcement learning, demonstrating superior performance over deep model-based approaches on a newly designed benchmark.
Authors:Taolin Zhang, Maosong Cao, Alexander Lam, Songyang Zhang, Kai Chen
Abstract:
Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.
中文: 本研究提出通用评估模型CompassJudger-2,通过任务驱动的多领域数据构建和可验证奖励监督机制,克服了现有评估模型的专业局限性和鲁棒性不足,在多个基准测试中表现优异,并借助JudgerBenchV2建立了新的评估标准。
English: This work introduces CompassJudger-2, a generalist judge model that overcomes specialization and robustness limitations through task-driven data curation and verifiable reward supervision, achieving competitive performance across benchmarks while establishing new evaluation standards with JudgerBenchV2.
Authors:Taolin Zhang, Zihan Ma, Maosong Cao, Junnan Liu, Songyang Zhang, Kai Chen
Abstract:
Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can substantially enhance both the performance and robustness of LLMs. Furthermore, we reveal both the consistency and inconsistency in the cognition of LLMs that may facilitate self-reflection and self-improvement, providing a potential direction for developing more powerful coding models.
中文摘要:Code Triangle框架研究表明,尽管大语言模型在代码生成中能形成自洽系统,但由于训练偏差和有限推理迁移,其解决方案缺乏人类程序员的多样性和鲁棒性,而融入人类生成内容与模型混合可显著提升模型性能与稳健性。
English Summary: The Code Triangle framework reveals that while large language models can form self-consistent systems in code generation, they lack human-level diversity and robustness due to training biases and limited reasoning transfer, though incorporating human input and model mixtures can significantly enhance their performance.
Authors:Chenyang Ren, Yifan Jia, Huanyi Xie, Zhaobin Xu, Tianxing Wei, Liangyu Wang, Lijie Hu, Di Wang
Abstract:
Sharpness-aware Minimization (SAM) improves generalization in large-scale model training by linking loss landscape geometry to generalization. However, challenges such as mislabeled noisy data and privacy concerns have emerged as significant issues. Data attribution, which identifies the contributions of specific training samples, offers a promising solution. However, directly rendering existing data influence evaluation tools such as influence functions (IF) to SAM will be inapplicable or inaccurate as SAM utilizes an inner loop to find model perturbations that maximize loss, which the outer loop then minimizes, resulting in a doubled computational structure. Additionally, this bilevel structure complicates the modeling of data influence on the parameters. In this paper, based on the IF, we develop two innovative data valuation methods for SAM, each offering unique benefits in different scenarios: the Hessian-based IF and the Gradient Trajectory-based IF. The first one provides a comprehensive estimation of data influence using a closed-form measure that relies only on the trained model weights. In contrast, the other IF for SAM utilizes gradient trajectory information during training for more accurate and efficient data assessment. Extensive experiments demonstrate their effectiveness in data evaluation and parameter tuning, with applications in identifying mislabeled data, model editing, and enhancing interpretability.
Sharpness-aware Minimization (SAM) enhances generalization but faces challenges with noisy data and privacy, which can be addressed through novel data valuation methods based on influence functions that offer efficient assessment and practical applications in model optimization.
English Summary:
Authors:Yiming Yang, Hongbin Lin, Yueru Luo, Suzhong Fu, Chao Zheng, Xinrui Yan, Shuqi Mei, Kun Tang, Shuguang Cui, Zhen Li
Abstract:
Lane segment topology reasoning provides comprehensive bird's-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).
中文:FASTopoWM框架通过结合快慢系统与潜在世界模型,改进了车道段拓扑推理,克服了现有方法的局限,并在OpenLane-V2基准测试中实现了检测和感知性能的显著提升。
English: The proposed FASTopoWM framework enhances lane segment topology reasoning by integrating fast-slow systems with latent world models, overcoming limitations of existing methods and achieving superior performance in detection and perception on the OpenLane-V2 benchmark.
Authors:Yixuan Dang, Qinyang Xu, Yu Zhang, Xiangtong Yao, Liding Zhang, Zhenshan Bing, Florian Roehrbein, Alois Knoll
Abstract:
Perception using whisker-inspired tactile sensors currently faces a major challenge: the lack of active control in robots based on direct contact information from the whisker. To accurately reconstruct object contours, it is crucial for the whisker sensor to continuously follow and maintain an appropriate relative touch pose on the surface. This is especially important for localization based on tip contact, which has a low tolerance for sharp surfaces and must avoid slipping into tangential contact. In this paper, we first construct a magnetically transduced whisker sensor featuring a compact and robust suspension system composed of three flexible spiral arms. We develop a method that leverages a characterized whisker deflection profile to directly extract the tip contact position using gradient descent, with a Bayesian filter applied to reduce fluctuations. We then propose an active motion control policy to maintain the optimal relative pose of the whisker sensor against the object surface. A B-Spline curve is employed to predict the local surface curvature and determine the sensor orientation. Results demonstrate that our algorithm can effectively track objects and reconstruct contours with sub-millimeter accuracy. Finally, we validate the method in simulations and real-world experiments where a robot arm drives the whisker sensor to follow the surfaces of three different objects.
中文摘要:本研究开发了一种采用磁传导系统的触须传感器及主动控制策略,通过保持最佳接触姿态,实现了机器人对物体轮廓的亚毫米级精确跟踪与重建。
English Summary: The study introduces a whisker sensor with a magnetic transduction system and active control policy to maintain optimal contact pose, enabling accurate object contour tracking and reconstruction with sub-millimeter precision in robotic applications.
Authors:Yukai Zhao, Shaohua Wang, Jue Wang, Xing Hu, Xin Xia
Abstract:
Fuzzing is widely used for detecting bugs and vulnerabilities, with various techniques proposed to enhance its effectiveness. To combine the advantages of multiple technologies, researchers proposed ensemble fuzzing, which integrates multiple base fuzzers. Despite promising results, state-of-the-art ensemble fuzzing techniques face limitations in resource scheduling and performance evaluation, leading to unnecessary resource waste. In this paper, we propose Legion, a novel ensemble fuzzing framework that dynamically schedules resources during the ensemble fuzzing campaign. We designed a novel resource scheduling algorithm based on the upper confidence bound algorithm to reduce the resource consumption of ineffective base fuzzers. Additionally, we introduce a multidimensional seed evaluation strategy, which considers multiple metrics to achieve more comprehensive fine-grained performance evaluation. We implemented Legion as a prototype tool and evaluated its effectiveness on Google's fuzzer-test-suite as well as real-world open-source projects. Results show that Legion outperforms existing state-of-the-art base fuzzers and ensemble fuzzing techniques, detecting 20 vulnerabilities in real-world open-source projects-five previously unknown and three classified as CVEs.
Chinese: Legion是一种新颖的集成模糊测试框架,通过动态资源调度和多维种子评估策略,在真实项目中检测出20个漏洞(含5个未知漏洞和3个CVE漏洞),性能优于现有技术。
English: Legion is a novel ensemble fuzzing framework that dynamically schedules resources and employs a multidimensional seed evaluation strategy, outperforming existing techniques by detecting 20 vulnerabilities including five previously unknown and three CVEs in real-world projects.
Authors:Tianlin Li, Yunxiang Wei, Zhiming Li, Aishan Liu, Qing Guo, Xianglong Liu, Dongning Sun, Yang Liu
Abstract:
Recent advances in code large language models (CodeLLMs) have made them indispensable tools in modern software engineering. However, these models occasionally produce outputs that contain proprietary or sensitive code snippets, raising concerns about potential non-compliant use of training data, and posing risks to privacy and intellectual property. To ensure responsible and compliant deployment of CodeLLMs, training data detection (TDD) has become a critical task. While recent TDD methods have shown promise in natural language settings, their effectiveness on code data remains largely underexplored. This gap is particularly important given code's structured syntax and distinct similarity criteria compared to natural language. To address this, we conduct a comprehensive empirical study of seven state-of-the-art TDD methods on source code data, evaluating their performance across eight CodeLLMs. To support this evaluation, we introduce CodeSnitch, a function-level benchmark dataset comprising 9,000 code samples in three programming languages, each explicitly labeled as either included or excluded from CodeLLM training. Beyond evaluation on the original CodeSnitch, we design targeted mutation strategies to test the robustness of TDD methods under three distinct settings. These mutation strategies are grounded in the well-established Type-1 to Type-4 code clone detection taxonomy. Our study provides a systematic assessment of current TDD techniques for code and offers insights to guide the development of more effective and robust detection methods in the future.
中文: 代码大语言模型的最新进展引发了对可能包含专有代码的担忧,因此需要有效的训练数据检测方法;本研究通过新基准CodeSnitch评估了七种先进方法在代码数据上的表现,并测试了它们在不同代码变异场景下的鲁棒性。
English: Recent progress in code large language models has raised concerns about their potential inclusion of proprietary code, prompting a need for effective training data detection methods, which this study evaluates using a new benchmark called CodeSnitch to assess robustness across various code mutations.
Authors:Zhengyu Wu, Xunkai Li, Yinlin Zhu, Zekai Chen, Guochen Yan, Yanyu Yan, Hao Zhang, Yuming Ai, Xinmo Jin, Rong-Hua Li, Guoren Wang
Abstract:
In the era of big data applications, Federated Graph Learning (FGL) has emerged as a prominent solution that reconcile the tradeoff between optimizing the collective intelligence between decentralized datasets holders and preserving sensitive information to maximum. Existing FGL surveys have contributed meaningfully but largely focus on integrating Federated Learning (FL) and Graph Machine Learning (GML), resulting in early stage taxonomies that emphasis on methodology and simulated scenarios. Notably, a data centric perspective, which systematically examines FGL methods through the lens of data properties and usage, remains unadapted to reorganize FGL research, yet it is critical to assess how FGL studies manage to tackle data centric constraints to enhance model performances. This survey propose a two-level data centric taxonomy: Data Characteristics, which categorizes studies based on the structural and distributional properties of datasets used in FGL, and Data Utilization, which analyzes the training procedures and techniques employed to overcome key data centric challenges. Each taxonomy level is defined by three orthogonal criteria, each representing a distinct data centric configuration. Beyond taxonomy, this survey examines FGL integration with Pretrained Large Models, showcases realistic applications, and highlights future direction aligned with emerging trends in GML.
中文: 联邦图学习(FGL)在平衡数据协作与隐私保护中发挥关键作用,本综述创新性地提出基于数据特征与利用的双层分类框架,突破现有方法导向的研究范式,系统重构该领域研究体系。
English: Federated Graph Learning (FGL) addresses the balance between collaborative data optimization and privacy protection, with this survey proposing a novel data-centric taxonomy based on data characteristics and utilization to reorganize research beyond existing method-focused approaches.
Authors:Shuhan Liu, Xing Hu, Kerui Huang, Xiaohu Yang, David Lo, Xin Xia
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities in code generation, where the natural language prompt plays a crucial role in conveying user intent to the model. However, prior studies have shown that LLMs are highly sensitive to prompt perturbations. Minor modifications in wording, syntax, or formatting can significantly reduce the functional correctness of generated code. As perturbations frequently occur in real-world scenarios, improving the robustness of LLMs to prompt perturbations is essential for ensuring reliable performance in practical code generation. In this paper, we introduce CREME (Code Robustness Enhancement via Model Editing), a novel approach that enhances LLM robustness through targeted parameter updates. CREME first identifies robustness-sensitive layers by comparing hidden states between an original prompt and its perturbed variant. Then, it performs lightweight parameter editing at the identified layer to reduce performance degradation. We evaluate CREME on two widely used code generation benchmarks (HumanEval and MBPP) along with their perturbed counterparts. Experimental results show that CREME improves Pass@1 accuracy by 63% on perturbed prompts while maintaining stable performance on clean inputs, with accuracy deviations within 1%. Further analysis reveals that robustness-sensitive layers are primarily concentrated in the middle and deeper layers of the network, and their locations vary across different model architectures. These insights provide a valuable foundation for developing future robustness-oriented editing strategies.
中文: 本文提出CREME方法,通过识别并编辑代码生成中大型语言模型的鲁棒敏感层,在保持原始提示性能稳定的同时,显著提升了扰动提示下的生成准确率。
English: This paper introduces CREME, a model editing approach that enhances the robustness of large language models in code generation by identifying and updating robustness-sensitive layers, achieving significant accuracy improvements on perturbed prompts while maintaining stable performance on clean inputs.
Authors:Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
Abstract:
Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.
中文摘要:本研究提出GUI-G²奖励框架,将图形界面元素建模为高斯分布以实现密集连续优化,相比现有方法性能提升最高达24.7%,并显著增强了对界面变化的鲁棒性。
English Summary: The study introduces GUI-G², a reward framework that models GUI elements as Gaussian distributions to provide dense continuous optimization signals, significantly outperforming existing methods by up to 24.7% and improving robustness to interface variations.
Authors:Nguyen Van Duc, Bui Duc Manh, Quang-Trung Luu, Dinh Thai Hoang, Van-Linh Nguyen, Diep N. Nguyen
Abstract:
This paper aims to propose a novel machine learning (ML) approach incorporating Homomorphic Encryption (HE) to address privacy limitations in Unmanned Aerial Vehicles (UAV)-based face detection. Due to challenges related to distance, altitude, and face orientation, high-resolution imagery and sophisticated neural networks enable accurate face recognition in dynamic environments. However, privacy concerns arise from the extensive surveillance capabilities of UAVs. To resolve this issue, we propose a novel framework that integrates HE with advanced neural networks to secure facial data throughout the inference phase. This method ensures that facial data remains secure with minimal impact on detection accuracy. Specifically, the proposed system leverages the Cheon-Kim-Kim-Song (CKKS) scheme to perform computations directly on encrypted data, optimizing computational efficiency and security. Furthermore, we develop an effective data encoding method specifically designed to preprocess the raw facial data into CKKS form in a Single-Instruction-Multiple-Data (SIMD) manner. Building on this, we design a secure inference algorithm to compute on ciphertext without needing decryption. This approach not only protects data privacy during the processing of facial data but also enhances the efficiency of UAV-based face detection systems. Experimental results demonstrate that our method effectively balances privacy protection and detection performance, making it a viable solution for UAV-based secure face detection. Significantly, our approach (while maintaining data confidentially with HE encryption) can still achieve an accuracy of less than 1% compared to the benchmark without using encryption.
本文提出了一种结合同态加密的机器学习方法,可在无人机人脸检测系统中保护数据隐私并保持高精度,实现安全高效的计算。
This paper introduces a homomorphic encryption-based machine learning approach that enables secure and efficient face detection in UAV systems while preserving privacy and maintaining high accuracy.
Authors:Yichun Yang, Rong-Hua Li, Meihao Liao, Guoren Wang
Abstract:
Effective Resistance (ER) is a fundamental tool in various graph learning tasks. In this paper, we address the problem of efficiently approximating ER on a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ with $n$ vertices and $m$ edges. First, we focus on local online-computation algorithms for ER approximation, aiming to improve the dependency on the approximation error parameter $ε$. Specifically, for a given vertex pair $(s,t)$, we propose a local algorithm with a time complexity of $\tilde{O}(\sqrt{d}/ε)$ to compute an $ε$-approximation of the $s,t$-ER value for expander graphs, where $d=\min \{d_s,d_t\}$. This improves upon the previous state-of-the-art, including an $\tilde{O}(1/ε^2)$ time algorithm based on random walk sampling by Andoni et al. (ITCS'19) and Peng et al. (KDD'21). Our method achieves this improvement by combining deterministic search with random walk sampling to reduce variance. Second, we establish a lower bound for ER approximation on expander graphs. We prove that for any $ε\in (0,1)$, there exist an expander graph and a vertex pair $(s,t)$ such that any local algorithm requires at least $Ω(1/ε)$ time to compute the $ε$-approximation of the $s,t$-ER value. Finally, we extend our techniques to index-based algorithms for ER computation. We propose an algorithm with $\tilde{O}(\min \{m+n/ε^{1.5},\sqrt{nm}/ε\})$ processing time, $\tilde{O}(n/ε)$ space complexity and $O(1)$ query complexity, which returns an $ε$-approximation of the $s,t$-ER value for any $s,t\in \mathcal{V}$ for expander graphs. Our approach improves upon the state-of-the-art $\tilde{O}(m/ε)$ processing time by Dwaraknath et al. (NeurIPS'24) and the $\tilde{O}(m+n/ε^2)$ processing time by Li and Sachdeva (SODA'23).
中文: 本文针对扩展图上的有效电阻近似问题,提出了高效的局部和索引算法,分别实现了$\tilde{O}(\sqrt{d}/ε)$时间复杂度的局部计算和$\tilde{O}(\min \{m+n/ε^{1.5},\sqrt{nm}/ε\})$处理时间的索引方法,同时建立了局部算法$Ω(1/ε)$时间复杂度的下界。
English: This paper introduces efficient local and index-based algorithms for approximating Effective Resistance on expander graphs, achieving improved time complexities of $\tilde{O}(\sqrt{d}/ε)$ for local computation and $\tilde{O}(\min \{m+n/ε^{1.5},\sqrt{nm}/ε\})$ for index-based approaches, while also establishing a matching lower bound of $Ω(1/ε)$ for local algorithms.
Authors:Yiming Yang, Yueru Luo, Bingkun He, Hongbin Lin, Suzhong Fu, Chao Zheng, Zhipeng Cao, Erlong Li, Chao Yan, Shuguang Cui, Zhen Li
Abstract:
Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.0% mAP in lane segment perception and +1.7% OLS in centerline perception tasks.
Chinese Summary: TopoStreamer是一种端到端时序感知模型,通过流式属性约束、动态车道边界位置编码和车道段去噪来改进车道段拓扑关系推理,在OpenLane-V2数据集上实现了显著性能提升。
English Summary: TopoStreamer is an end-to-end temporal perception model that enhances lane segment topology reasoning through streaming attribute constraints, dynamic positional encoding, and lane segment denoising, achieving significant performance improvements on the OpenLane-V2 dataset.
Authors:Jiale Ding, Xiang Zheng, Cong Wang, Wei-Bin Lee, Xingjun Ma, Yu-Gang Jiang
Abstract:
As Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications, evaluating their safety-especially under adversarial prompting-has become critical. Arguably, effective safety evaluations should be adaptive, evolving with LLM capabilities, and also cover a broad spectrum of harmful topics and real-world scenarios to fully expose potential vulnerabilities. Existing manual safety benchmarks, built on handcrafted adversarial prompts, are limited by their static nature and the intensive labor required to update them, making it difficult to keep pace with rapidly advancing LLMs. In contrast, automated adversarial prompt generation offers a promising path toward adaptive evaluation. However, current methods often suffer from insufficient adversarial topic coverage (topic-level diversity) and weak alignment with real-world contexts. These shortcomings stem from the exploration-exploitation dilemma in black-box optimization and a lack of real-world contextualization, resulting in adversarial prompts that are both topically narrow and scenario-repetitive. To address these issues, we propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM for generating topically diverse and contextually rich adversarial prompts. Experiments show that ROSE outperforms existing methods in uncovering safety vulnerabilities in state-of-the-art LLMs, with notable improvements in integrated evaluation metrics. We hope ROSE represents a step toward more practical and reality-oriented safety evaluation of LLMs. WARNING: This paper contains examples of potentially harmful text.
中文:ROSE框架采用多目标强化学习生成多样化和情境丰富的对抗性提示,相比现有方法,显著提升了对大型语言模型安全漏洞的检测能力。
English: The ROSE framework employs multi-objective reinforcement learning to generate diverse and contextually rich adversarial prompts, significantly improving the detection of safety vulnerabilities in large language models compared to existing methods.
Authors:Yang Yu, Kai Han, Hang Zhou, Yehui Tang, Kaiqi Huang, Yunhe Wang, Dacheng Tao
Abstract:
While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model's data preferences evolve throughout training, providing new insights into the data preference of the model during training.
Chinese: 本文提出了一种数据加权模型(DWM),通过双层优化框架在训练中动态调整数据权重,不仅提升了模型性能与效率,还揭示了模型数据偏好的动态演变规律。
English: The paper introduces a Data Weighting Model (DWM) that dynamically adjusts data weights during training using a bi-level optimization framework, improving model performance and efficiency while offering insights into evolving data preferences.
Authors:Mingyuan Sun, Zheng Fang, Jiaxu Wang, Kunyi Zhang, Qiang Zhang, Renjing Xu
Abstract:
We present GravLensX, an innovative method for rendering black holes with gravitational lensing effects using neural networks. The methodology involves training neural networks to fit the spacetime around black holes and then employing these trained models to generate the path of light rays affected by gravitational lensing. This enables efficient and scalable simulations of black holes with optically thin accretion disks, significantly decreasing the time required for rendering compared to traditional methods. We validate our approach through extensive rendering of multiple black hole systems with superposed Kerr metric, demonstrating its capability to produce accurate visualizations with significantly $15\times$ reduced computational time. Our findings suggest that neural networks offer a promising alternative for rendering complex astrophysical phenomena, potentially paving a new path to astronomical visualization.
中文摘要:GravLensX采用神经网络高效模拟黑洞的引力透镜效应,在保证精度的同时,渲染速度比传统方法快15倍。
English Summary: GravLensX utilizes neural networks to efficiently simulate black holes with gravitational lensing, achieving 15 times faster rendering while maintaining accuracy compared to conventional techniques.
Authors:Yanjie Fu, Dongjie Wang
Abstract:
Generative AI, large language models, and agentic AI have emerged separately of urban planning. However, the convergence between AI and urban planning presents an interesting opportunity towards AI urban planners. Existing studies conceptualizes urban planning as a generative AI task, where AI synthesizes land-use configurations under geospatial, social, and human-centric constraints and reshape automated urban design. We further identify critical gaps of existing generative urban planning studies: 1) the generative structure has to be predefined with strong assumption: all of adversarial generator-discriminator, forward and inverse diffusion structures, hierarchical zone-POI generative structure are predefined by humans; 2) ignore the power of domain expert developed tools: domain urban planners have developed various tools in the urban planning process guided by urban theory, while existing pure neural networks based generation ignore the power of the tools developed by urban planner practitioners. To address these limitations, we outline a future research direction agentic urban AI planner, calling for a new synthesis of agentic AI and participatory urbanism.
中文摘要:人工智能与城市规划的融合为AI城市规划师提供了机遇,但现有生成方法依赖预设结构且忽视领域工具,因此需发展融合参与式城市主义的自主AI规划系统。
English Summary: The convergence of AI and urban planning offers potential for AI urban planners, but current generative approaches rely on predefined structures and overlook domain-specific tools, prompting a call for agentic AI that integrates participatory urbanism.
Authors:Rui Liu, Tao Zhe, Zhong-Ren Peng, Necati Catbas, Xinyue Ye, Dongjie Wang, Yanjie Fu
Abstract:
Generative AI, large language models, and agentic AI have emerged separately of urban planning. However, the convergence between AI and urban planning presents an interesting opportunity towards AI urban planners. Existing studies conceptualizes urban planning as a generative AI task, where AI synthesizes land-use configurations under geospatial, social, and human-centric constraints and reshape automated urban design. We further identify critical gaps of existing generative urban planning studies: 1) the generative structure has to be predefined with strong assumption: all of adversarial generator-discriminator, forward and inverse diffusion structures, hierarchical zone-POI generative structure are predefined by humans; 2) ignore the power of domain expert developed tools: domain urban planners have developed various tools in the urban planning process guided by urban theory, while existing pure neural networks based generation ignore the power of the tools developed by urban planner practitioners. To address these limitations, we outline a future research direction agentic urban AI planner, calling for a new synthesis of agentic AI and participatory urbanism.
中文摘要:人工智能与城市规划的融合为AI城市规划师提供了机遇,但现有生成方法依赖预设结构且忽视领域工具,因此需发展融合参与式城市主义的自主AI规划系统。
English Summary: The convergence of AI and urban planning offers potential for AI urban planners, but current generative approaches rely on predefined structures and overlook domain-specific tools, prompting a call for agentic AI that integrates participatory urbanism.
Authors:Arun Vignesh Malarkkan, Dongjie Wang, Haoyue Bai, Yanjie Fu
Abstract:
The escalating threat of cyberattacks on real-time critical infrastructures poses serious risks to public safety, demanding detection methods that effectively capture complex system interdependencies and adapt to evolving attack patterns. Traditional real-time anomaly detection techniques often suffer from excessive false positives due to their statistical sensitivity to high data variance and class imbalance. To address these limitations, recent research has explored modeling causal relationships among system components. However, prior work mainly focuses on offline causal graph-based approaches that require static historical data and fail to generalize to real-time settings. These methods are fundamentally constrained by: (1) their inability to adapt to dynamic shifts in data distribution without retraining, and (2) the risk of catastrophic forgetting when lacking timely supervision in live systems. To overcome these challenges, we propose INCADET, a novel framework for incremental causal graph learning tailored to real-time cyberattack detection. INCADET dynamically captures evolving system behavior by incrementally updating causal graphs across streaming time windows. The framework comprises three modules: 1) Early Symptom Detection: Detects transitions in system status using divergence in edge-weight distributions across sequential causal graphs. 2) Incremental Causal Graph Learning: Leverages experience replay and edge reinforcement to continually refine causal structures while preserving prior knowledge. 3) Causal Graph Classification: Employs Graph Convolutional Networks (GCNs) to classify system status using the learned causal graphs. Extensive experiments on real-world critical infrastructure datasets demonstrate that INCADET achieves superior accuracy, robustness, and adaptability compared to both static causal and deep temporal baselines in evolving attack scenarios.
中文: 提出的INCADET框架通过从流数据中增量学习因果图,克服了传统方法的局限性,能够在关键基础设施系统中实现实时网络攻击检测,并具有更高的准确性和适应性。
English: The proposed INCADET framework overcomes limitations of traditional methods by incrementally learning causal graphs from streaming data, enabling real-time cyberattack detection with enhanced accuracy and adaptability in critical infrastructure systems.
Authors:Longhui Zhang, Bin Wang, Jiahao Wang, Xiaofeng Zhao, Min Zhang, Hao Yang, Meishan Zhang, Yu Li, Jing Li, Jun Yu, Min Zhang
Abstract:
Large language models (LLMs) have made significant strides in code translation tasks. However, ensuring both the correctness and readability of translated code remains a challenge, limiting their effective adoption in real-world software development. In this work, we propose F2STrans, a function-to-style guiding paradigm designed to progressively improve the performance of LLMs in code translation. Our approach comprises two key stages: (1) Functional learning, which optimizes translation correctness using high-quality source-target code pairs mined from online programming platforms, and (2) Style learning, which improves translation readability by incorporating both positive and negative style examples. Additionally, we introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations, enabling comprehensive functional and stylistic evaluations. Experiments on both our new benchmark and existing datasets demonstrate that our approach significantly improves code translation performance. Notably, our approach enables Qwen-1.5B to outperform prompt-enhanced Qwen-32B and GPT-4 on average across 20 diverse code translation scenarios.
中文: 本文提出F2STrans双阶段优化框架,通过功能学习确保代码翻译正确性、风格学习提升可读性,使小规模模型在多种代码翻译场景中超越大型模型的性能表现。
English: This paper introduces F2STrans, a two-stage paradigm that enhances code translation by LLMs through functional learning for correctness and style learning for readability, achieving superior performance even with smaller models compared to larger ones in diverse scenarios.
Authors:Arun Vignesh Malarkkan, Haoyue Bai, Xinyuan Wang, Anjali Kaushik, Dongjie Wang, Yanjie Fu
Abstract:
As cyber-physical systems grow increasingly interconnected and spatially distributed, ensuring their resilience against evolving cyberattacks has become a critical priority. Spatio-Temporal Anomaly detection plays an important role in ensuring system security and operational integrity. However, current data-driven approaches, largely driven by black-box deep learning, face challenges in interpretability, adaptability to distribution shifts, and robustness under evolving system dynamics. In this paper, we advocate for a causal learning perspective to advance anomaly detection in spatially distributed infrastructures that grounds detection in structural cause-effect relationships. We identify and formalize three key directions: causal graph profiling, multi-view fusion, and continual causal graph learning, each offering distinct advantages in uncovering dynamic cause-effect structures across time and space. Drawing on real-world insights from systems such as water treatment infrastructures, we illustrate how causal models provide early warning signals and root cause attribution, addressing the limitations of black-box detectors. Looking ahead, we outline the future research agenda centered on multi-modality, generative AI-driven, and scalable adaptive causal frameworks. Our objective is to lay a new research trajectory toward scalable, adaptive, explainable, and spatially grounded anomaly detection systems. We hope to inspire a paradigm shift in cybersecurity research, promoting causality-driven approaches to address evolving threats in interconnected infrastructures.
中文摘要:本文主张采用因果学习方法改进网络物理系统中的时空异常检测,通过三个关键方向解决黑盒模型的局限性,并为构建可解释的自适应安全框架规划了未来研究路径。
English Summary: This paper advocates for causal learning to enhance spatio-temporal anomaly detection in cyber-physical systems, addressing limitations of black-box models through three key approaches and proposing future research directions for explainable and adaptive security frameworks.
Authors:Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Nathan Byrd, Ashrith Sheshan, Raia Hadsell Sangnie Bhardwaj, Pawel Janus, Tero Rissa, Dan Horgan, Sharon Silver, Ayzaan Wahid, Sergey Brin, Yves Raimond, Klemen Kloboves, Cindy Wang, Nitesh Bharadwaj Gundavarapu, Ilia Shumailov, Bo Wang, Mantas Pajarskas, Joe Heyward, Martin Nikoltchev, Maciej Kula, Hao Zhou, Zachary Garrett, Sushant Kafle, Sercan Arik, Ankita Goel, Mingyao Yang, Jiho Park, Koji Kojima, Parsa Mahmoudieh, Koray Kavukcuoglu, Grace Chen, Doug Fritz, Anton Bulyenov, Sudeshna Roy, Dimitris Paparas, Hadar Shemtov, Bo-Juen Chen, Robin Strudel, David Reitter, Aurko Roy, Andrey Vlasov, Changwan Ryu, Chas Leichner, Haichuan Yang, Zelda Mariet, Denis Vnukov, Tim Sohn, Amy Stuart, Wei Liang, Minmin Chen, Praynaa Rawlani, Christy Koh, JD Co-Reyes, Guangda Lai, Praseem Banzal, Dimitrios Vytiniotis, Jieru Mei, Mu Cai, Mohammed Badawi, Corey Fry, Ale Hartman, Daniel Zheng, Eric Jia, James Keeling, Annie Louis, Ying Chen, Efren Robles, Wei-Chih Hung, Howard Zhou, Nikita Saxena, Sonam Goenka, Olivia Ma, Zach Fisher, Mor Hazan Taege, Emily Graves, David Steiner, Yujia Li, Sarah Nguyen, Rahul Sukthankar, Joe Stanton, Ali Eslami, Gloria Shen, Berkin Akin, Alexey Guseynov, Yiqian Zhou, Jean-Baptiste Alayrac, Armand Joulin, Efrat Farkash, Ashish Thapliyal, Stephen Roller, Noam Shazeer, Todor Davchev, Terry Koo, Hannah Forbes-Pollard, Kartik Audhkhasi, Greg Farquhar, Adi Mayrav Gilady, Maggie Song, John Aslanides, Piermaria Mendolicchio, Alicia Parrish, John Blitzer, Pramod Gupta, Xiaoen Ju, Xiaochen Yang, Puranjay Datta, Andrea Tacchetti, Sanket Vaibhav Mehta, Gregory Dibb, Shubham Gupta, Federico Piccinini, Raia Hadsell, Sujee Rajayogam, Jiepu Jiang, Patrick Griffin, Patrik Sundberg, Jamie Hayes, Alexey Frolov, Tian Xie, Adam Zhang, Kingshuk Dasgupta, Uday Kalra, Lior Shani, Klaus Macherey, Tzu-Kuo Huang, Liam MacDermed, Karthik Duddu, Paulo Zacchello, Zi Yang, Jessica Lo, Kai Hui, Matej Kastelic, Derek Gasaway, Qijun Tan, Summer Yue, Pablo Barrio, John Wieting, Weel Yang, Andrew Nystrom, Solomon Demmessie, Anselm Levskaya, Fabio Viola, Chetan Tekur, Greg Billock, George Necula, Mandar Joshi, Rylan Schaeffer, Swachhand Lokhande, Christina Sorokin, Pradeep Shenoy, Mia Chen, Mark Collier, Hongji Li, Taylor Bos, Nevan Wichers, Sun Jae Lee, Angéline Pouget, Santhosh Thangaraj, Kyriakos Axiotis, Phil Crone, Rachel Sterneck, Nikolai Chinaev, Victoria Krakovna, Oleksandr Ferludin, Ian Gemp, Stephanie Winkler, Dan Goldberg, Ivan Korotkov, Kefan Xiao, Malika Mehrotra, Sandeep Mariserla, Vihari Piratla, Terry Thurk, Khiem Pham, Hongxu Ma, Alexandre Senges, Ravi Kumar, Clemens Meyer, Ellie Talius, Nuo Wang Pierse, Ballie Sandhu, Horia Toma, Kuo Lin, Swaroop Nath, Tom Stone, Dorsa Sadigh, Nikita Gupta, Arthur Guez, Avi Singh, Matt Thomas, Tom Duerig, Yuan Gong, Richard Tanburn, Lydia Lihui Zhang, Phuong Dao, Mohamed Hammad, Sirui Xie, Shruti Rijhwani, Ben Murdoch, Duhyeon Kim, Will Thompson, Heng-Tze Cheng, Daniel Sohn, Pablo Sprechmann, Qiantong Xu, Srinivas Tadepalli, Peter Young, Ye Zhang, Hansa Srinivasan, Miranda Aperghis, Aditya Ayyar, Hen Fitoussi, Ryan Burnell, David Madras, Mike Dusenberry, Xi Xiong, Tayo Oguntebi, Ben Albrecht, Jörg Bornschein, Jovana MitroviÄ, Mason Dimarco, Bhargav Kanagal Shamanna, Premal Shah, Eren Sezener, Shyam Upadhyay, Dave Lacey, Craig Schiff, Sebastien Baur, Sanjay Ganapathy, Eva Schnider, Mateo Wirth, Connor Schenck, Andrey Simanovsky, Yi-Xuan Tan, Philipp Fränken, Dennis Duan, Bharath Mankalale, Nikhil Dhawan, Kevin Sequeira, Zichuan Wei, Shivanker Goel, Caglar Unlu, Yukun Zhu, Haitian Sun, Ananth Balashankar, Kurt Shuster, Megh Umekar, Mahmoud Alnahlawi, Aäron van den Oord, Kelly Chen, Yuexiang Zhai, Zihang Dai, Kuang-Huei Lee, Eric Doi, Lukas Zilka, Rohith Vallu, Disha Shrivastava, Jason Lee, Hisham Husain, Honglei Zhuang, Vincent Cohen-Addad, Jarred Barber, James Atwood, Adam Sadovsky, Quentin Wellens, Steven Hand, Arunkumar Rajendran, Aybuke Turker, CJ Carey, Yuanzhong Xu, Hagen Soltau, Zefei Li, Xinying Song, Conglong Li, Iurii Kemaev, Sasha Brown, Andrea Burns, Viorica Patraucean, Piotr Stanczyk, Renga Aravamudhan, Mathieu Blondel, Hila Noga, Lorenzo Blanco, Will Song, Michael Isard, Mandar Sharma, Reid Hayes, Dalia El Badawy, Avery Lamp, Itay Laish, Olga Kozlova, Kelvin Chan, Sahil Singla, Srinivas Sunkara, Mayank Upadhyay, Chang Liu, Aijun Bai, Jarek Wilkiewicz, Martin Zlocha, Jeremiah Liu, Zhuowan Li, Haiguang Li, Omer Barak, Ganna Raboshchuk, Jiho Choi, Fangyu Liu, Erik Jue, Mohit Sharma, Andreea Marzoca, Robert Busa-Fekete, Anna Korsun, Andre Elisseeff, Zhe Shen, Sara Mc Carthy, Kay Lamerigts, Anahita Hosseini, Hanzhao Lin, Charlie Chen, Fan Yang, Kushal Chauhan, Mark Omernick, Dawei Jia, Karina Zainullina, Demis Hassabis, Danny Vainstein, Ehsan Amid, Xiang Zhou, Ronny Votel, Eszter Vértes, Xinjian Li, Zongwei Zhou, Angeliki Lazaridou, Brendan McMahan, Arjun Narayanan, Hubert Soyer, Sujoy Basu, Kayi Lee, Bryan Perozzi, Qin Cao, Leonard Berrada, Rahul Arya, Ke Chen, Katrina, Xu, Matthias Lochbrunner, Alex Hofer, Sahand Sharifzadeh, Renjie Wu, Sally Goldman, Pranjal Awasthi, Xuezhi Wang, Yan Wu, Claire Sha, Biao Zhang, Maciej MikuÅa, Filippo Graziano, Siobhan Mcloughlin, Irene Giannoumis, Youhei Namiki, Chase Malik, Carey Radebaugh, Jamie Hall, Ramiro Leal-Cavazos, Jianmin Chen, Vikas Sindhwani, David Kao, David Greene, Jordan Griffith, Chris Welty, Ceslee Montgomery, Toshihiro Yoshino, Liangzhe Yuan, Noah Goodman, Assaf Hurwitz Michaely, Kevin Lee, KP Sawhney, Wei Chen, Zheng Zheng, Megan Shum, Nikolay Savinov, Etienne Pot, Alex Pak, Morteza Zadimoghaddam, Sijal Bhatnagar, Yoad Lewenberg, Blair Kutzman, Ji Liu, Lesley Katzen, Jeremy Selier, Josip Djolonga, Dmitry Lepikhin, Kelvin Xu, Jacky Liang, Jiewen Tan, Benoit Schillings, Muge Ersoy, Pete Blois, Bernd Bandemer, Abhimanyu Singh, Sergei Lebedev, Pankaj Joshi, Adam R. Brown, Evan Palmer, Shreya Pathak, Komal Jalan, Fedir Zubach, Shuba Lall, Randall Parker, Alok Gunjan, Sergey Rogulenko, Sumit Sanghai, Zhaoqi Leng, Zoltan Egyed, Shixin Li, Maria Ivanova, Kostas Andriopoulos, Jin Xie, Elan Rosenfeld, Auriel Wright, Ankur Sharma, Xinyang Geng, Yicheng Wang, Sam Kwei, Renke Pan, Yujing Zhang, Gabby Wang, Xi Liu, Chak Yeung, Elizabeth Cole, Aviv Rosenberg, Zhen Yang, Phil Chen, George Polovets, Pranav Nair, Rohun Saxena, Josh Smith, Shuo-yiin Chang, Aroma Mahendru, Svetlana Grant, Anand Iyer, Irene Cai, Jed McGiffin, Jiaming Shen, Alanna Walton, Antonious Girgis, Oliver Woodman, Rosemary Ke, Mike Kwong, Louis Rouillard, Jinmeng Rao, Zhihao Li, Yuntao Xu, Flavien Prost, Chi Zou, Ziwei Ji, Alberto Magni, Tyler Liechty, Dan A. Calian, Deepak Ramachandran, Igor Krivokon, Hui Huang, Terry Chen, Anja Hauth, Anastasija IliÄ, Weijuan Xi, Hyeontaek Lim, Vlad-Doru Ion, Pooya Moradi, Metin Toksoz-Exley, Kalesha Bullard, Miltos Allamanis, Xiaomeng Yang, Sophie Wang, Zhi Hong, Anita Gergely, Cheng Li, Bhavishya Mittal, Vitaly Kovalev, Victor Ungureanu, Jane Labanowski, Jan Wassenberg, Nicolas Lacasse, Geoffrey Cideron, Petar DeviÄ, Annie Marsden, Lynn Nguyen, Michael Fink, Yin Zhong, Tatsuya Kiyono, Desi Ivanov, Sally Ma, Max Bain, Kiran Yalasangi, Jennifer She, Anastasia Petrushkina, Mayank Lunayach, Carla Bromberg, Sarah Hodkinson, Vilobh Meshram, Daniel Vlasic, Austin Kyker, Steve Xu, Jeff Stanway, Zuguang Yang, Kai Zhao, Matthew Tung, Seth Odoom, Yasuhisa Fujii, Justin Gilmer, Eunyoung Kim, Felix Halim, Quoc Le, Bernd Bohnet, Seliem El-Sayed, Behnam Neyshabur, Malcolm Reynolds, Dean Reich, Yang Xu, Erica Moreira, Anuj Sharma, Zeyu Liu, Mohammad Javad Hosseini, Naina Raisinghani, Yi Su, Ni Lao, Daniel Formoso, Marco Gelmi, Almog Gueta, Tapomay Dey, Elena Gribovskaya, Domagoj Äevid, Sidharth Mudgal, Garrett Bingham, Jianling Wang, Anurag Kumar, Alex Cullum, Feng Han, Konstantinos Bousmalis, Diego Cedillo, Grace Chu, Vladimir Magay, Paul Michel, Ester Hlavnova, Daniele Calandriello, Setareh Ariafar, Kaisheng Yao, Vikash Sehwag, Arpi Vezer, Agustin Dal Lago, Zhenkai Zhu, Paul Kishan Rubenstein, Allen Porter, Anirudh Baddepudi, Oriana Riva, Mihai Dorin Istin, Chih-Kuan Yeh, Zhi Li, Andrew Howard, Nilpa Jha, Jeremy Chen, Raoul de Liedekerke, Zafarali Ahmed, Mikel Rodriguez, Tanuj Bhatia, Bangju Wang, Ali Elqursh, David Klinghoffer, Peter Chen, Pushmeet Kohli, Te I, Weiyang Zhang, Zack Nado, Jilin Chen, Maxwell Chen, George Zhang, Aayush Singh, Adam Hillier, Federico Lebron, Yiqing Tao, Ting Liu, Gabriel Dulac-Arnold, Jingwei Zhang, Shashi Narayan, Buhuang Liu, Orhan Firat, Abhishek Bhowmick, Bingyuan Liu, Hao Zhang, Zizhao Zhang, Georges Rotival, Nathan Howard, Anu Sinha, Alexander Grushetsky, Benjamin Beyret, Keerthana Gopalakrishnan, James Zhao, Kyle He, Szabolcs Payrits, Zaid Nabulsi, Zhaoyi Zhang, Weijie Chen, Edward Lee, Nova Fallen, Sreenivas Gollapudi, Aurick Zhou, Filip PavetiÄ, Thomas Köppe, Shiyu Huang, Rama Pasumarthi, Nick Fernando, Felix Fischer, Daria Äurko, Yang Gao, James Svensson, Austin Stone, Haroon Qureshi, Abhishek Sinha, Apoorv Kulshreshtha, Martin Matysiak, Jieming Mao, Carl Saroufim, Aleksandra Faust, Qingnan Duan, Gil Fidel, Kaan Katircioglu, Raphaël Lopez Kaufman, Dhruv Shah, Weize Kong, Abhishek Bapna, Gellért Weisz, Emma Dunleavy, Praneet Dutta, Tianqi Liu, Rahma Chaabouni, Carolina Parada, Marcus Wu, Alexandra Belias, Alessandro Bissacco, Stanislav Fort, Li Xiao, Fantine Huot, Chris Knutsen, Yochai Blau, Gang Li, Jennifer Prendki, Juliette Love, Yinlam Chow, Pichi Charoenpanit, Hidetoshi Shimokawa, Vincent Coriou, Karol Gregor, Tomas Izo, Arjun Akula, Mario Pinto, Chris Hahn, Dominik Paulus, Jiaxian Guo, Neha Sharma, Cho-Jui Hsieh, Adaeze Chukwuka, Kazuma Hashimoto, Nathalie Rauschmayr, Ling Wu, Christof Angermueller, Yulong Wang, Sebastian Gerlach, Michael Pliskin, Daniil Mirylenka, Min Ma, Lexi Baugher, Bryan Gale, Shaan Bijwadia, Nemanja RakiÄeviÄ, David Wood, Jane Park, Chung-Ching Chang, Babi Seal, Chris Tar, Kacper Krasowiak, Yiwen Song, Georgi Stephanov, Gary Wang, Marcello Maggioni, Stein Xudong Lin, Felix Wu, Shachi Paul, Zixuan Jiang, Shubham Agrawal, Bilal Piot, Alex Feng, Cheolmin Kim, Tulsee Doshi, Jonathan Lai, Chuqiao, Xu, Sharad Vikram, Ciprian Chelba, Sebastian Krause, Vincent Zhuang, Jack Rae, Timo Denk, Adrian Collister, Lotte Weerts, Xianghong Luo, Yifeng Lu, HÃ¥vard Garnes, Nitish Gupta, Terry Spitz, Avinatan Hassidim, Lihao Liang, Izhak Shafran, Peter Humphreys, Kenny Vassigh, Phil Wallis, Virat Shejwalkar, Nicolas Perez-Nieves, Rachel Hornung, Melissa Tan, Beka Westberg, Andy Ly, Richard Zhang, Brian Farris, Jongbin Park, Alec Kosik, Zeynep Cankara, Andrii Maksai, Yunhan Xu, Albin Cassirer, Sergi Caelles, Abbas Abdolmaleki, Mencher Chiang, Alex Fabrikant, Shravya Shetty, Luheng He, Mai Giménez, Hadi Hashemi, Sheena Panthaplackel, Yana Kulizhskaya, Salil Deshmukh, Daniele Pighin, Robin Alazard, Disha Jindal, Seb Noury, Pradeep Kumar S, Siyang Qin, Xerxes Dotiwalla, Stephen Spencer, Mohammad Babaeizadeh, Blake JianHang Chen, Vaibhav Mehta, Jennie Lees, Andrew Leach, Penporn Koanantakool, Ilia Akolzin, Ramona Comanescu, Junwhan Ahn, Alexey Svyatkovskiy, Basil Mustafa, David D'Ambrosio, Shiva Mohan Reddy Garlapati, Pascal Lamblin, Alekh Agarwal, Shuang Song, Pier Giuseppe Sessa, Pauline Coquinot, John Maggs, Hussain Masoom, Divya Pitta, Yaqing Wang, Patrick Morris-Suzuki, Billy Porter, Johnson Jia, Jeffrey Dudek, Raghavender R, Cosmin Paduraru, Alan Ansell, Tolga Bolukbasi, Tony Lu, Ramya Ganeshan, Zi Wang, Henry Griffiths, Rodrigo Benenson, Yifan He, James Swirhun, George Papamakarios, Aditya Chawla, Kuntal Sengupta, Yan Wang, Vedrana Milutinovic, Igor Mordatch, Zhipeng Jia, Jamie Smith, Will Ng, Shitij Nigam, Matt Young, Eugen VuÅ¡ak, Blake Hechtman, Sheela Goenka, Avital Zipori, Kareem Ayoub, Ashok Popat, Trilok Acharya, Luo Yu, Dawn Bloxwich, Hugo Song, Paul Roit, Haiqiong Li, Aviel Boag, Nigamaa Nayakanti, Bilva Chandra, Tianli Ding, Aahil Mehta, Cath Hope, Jiageng Zhang, Idan Heimlich Shtacher, Kartikeya Badola, Ryo Nakashima, Andrei Sozanschi, Iulia ComÅa, Ante Žužul, Emily Caveness, Julian Odell, Matthew Watson, Dario de Cesare, Phillip Lippe, Derek Lockhart, Siddharth Verma, Huizhong Chen, Sean Sun, Lin Zhuo, Aditya Shah, Prakhar Gupta, Alex Muzio, Ning Niu, Amir Zait, Abhinav Singh, Meenu Gaba, Fan Ye, Prajit Ramachandran, Mohammad Saleh, Raluca Ada Popa, Ayush Dubey, Frederick Liu, Sara Javanmardi, Mark Epstein, Ross Hemsley, Richard Green, Nishant Ranka, Eden Cohen, Chuyuan Kelly Fu, Sanjay Ghemawat, Jed Borovik, James Martens, Anthony Chen, Pranav Shyam, André Susano Pinto, Ming-Hsuan Yang, Alexandru Å¢ifrea, David Du, Boqing Gong, Ayushi Agarwal, Seungyeon Kim, Christian Frank, Saloni Shah, Xiaodan Song, Zhiwei Deng, Ales Mikhalap, Kleopatra Chatziprimou, Timothy Chung, Toni Creswell, Susan Zhang, Yennie Jun, Carl Lebsack, Will Truong, Slavica AndaÄiÄ, Itay Yona, Marco Fornoni, Rong Rong, Serge Toropov, Afzal Shama Soudagar, Andrew Audibert, Salah Zaiem, Zaheer Abbas, Andrei Rusu, Sahitya Potluri, Shitao Weng, Anastasios Kementsietsidis, Anton Tsitsulin, Daiyi Peng, Natalie Ha, Sanil Jain, Tejasi Latkar, Simeon Ivanov, Cory McLean, Anirudh GP, Rajesh Venkataraman, Canoee Liu, Dilip Krishnan, Joel D'sa, Roey Yogev, Paul Collins, Benjamin Lee, Lewis Ho, Carl Doersch, Gal Yona, Shawn Gao, Felipe Tiengo Ferreira, Adnan Ozturel, Hannah Muckenhirn, Ce Zheng, Gargi Balasubramaniam, Mudit Bansal, George van den Driessche, Sivan Eiger, Salem Haykal, Vedant Misra, Abhimanyu Goyal, Danilo Martins, Gary Leung, Jonas Valfridsson, Four Flynn, Will Bishop, Chenxi Pang, Yoni Halpern, Honglin Yu, Lawrence Moore, Yuvein, Zhu, Sridhar Thiagarajan, Yoel Drori, Zhisheng Xiao, Lucio Dery, Rolf Jagerman, Jing Lu, Eric Ge, Vaibhav Aggarwal, Arjun Khare, Vinh Tran, Oded Elyada, Ferran Alet, James Rubin, Ian Chou, David Tian, Libin Bai, Lawrence Chan, Lukasz Lew, Karolis Misiunas, Taylan Bilal, Aniket Ray, Sindhu Raghuram, Alex Castro-Ros, Viral Carpenter, CJ Zheng, Michael Kilgore, Josef Broder, Emily Xue, Praveen Kallakuri, Dheeru Dua, Nancy Yuen, Steve Chien, John Schultz, Saurabh Agrawal, Reut Tsarfaty, Jingcao Hu, Ajay Kannan, Dror Marcus, Nisarg Kothari, Baochen Sun, Ben Horn, Matko BoÅ¡njak, Ferjad Naeem, Dean Hirsch, Lewis Chiang, Boya Fang, Jie Han, Qifei Wang, Ben Hora, Antoine He, Mario LuÄiÄ, Beer Changpinyo, Anshuman Tripathi, John Youssef, Chester Kwak, Philippe Schlattner, Cat Graves, Rémi Leblond, Wenjun Zeng, Anders Andreassen, Gabriel Rasskin, Yue Song, Eddie Cao, Junhyuk Oh, Matt Hoffman, Wojtek Skut, Yichi Zhang, Jon Stritar, Xingyu Cai, Saarthak Khanna, Kathie Wang, Shriya Sharma, Christian Reisswig, Younghoon Jun, Aman Prasad, Tatiana Sholokhova, Preeti Singh, Adi Gerzi Rosenthal, Anian Ruoss, Françoise Beaufays, Sean Kirmani, Dongkai Chen, Johan Schalkwyk, Jonathan Herzig, Been Kim, Josh Jacob, Damien Vincent, Adrian N Reyes, Ivana Balazevic, Léonard Hussenot, Jon Schneider, Parker Barnes, Luis Castro, Spandana Raj Babbula, Simon Green, Serkan Cabi, Nico Duduta, Danny Driess, Rich Galt, Noam Velan, Junjie Wang, Hongyang Jiao, Matthew Mauger, Du Phan, Miteyan Patel, Vlado GaliÄ, Jerry Chang, Eyal Marcus, Matt Harvey, Julian Salazar, Elahe Dabir, Suraj Satishkumar Sheth, Amol Mandhane, Hanie Sedghi, Jeremiah Willcock, Amir Zandieh, Shruthi Prabhakara, Aida Amini, Antoine Miech, Victor Stone, Massimo Nicosia, Paul Niemczyk, Ying Xiao, Lucy Kim, SÅawek Kwasiborski, Vikas Verma, Ada Maksutaj Oflazer, Christoph Hirnschall, Peter Sung, Lu Liu, Richard Everett, Michiel Bakker, Ãgoston Weisz, Yufei Wang, Vivek Sampathkumar, Uri Shaham, Bibo Xu, Yasemin Altun, Mingqiu Wang, Takaaki Saeki, Guanjie Chen, Emanuel Taropa, Shanthal Vasanth, Sophia Austin, Lu Huang, Goran Petrovic, Qingyun Dou, Daniel Golovin, Grigory Rozhdestvenskiy, Allie Culp, Will Wu, Motoki Sano, Divya Jain, Julia Proskurnia, Sébastien Cevey, Alejandro Cruzado Ruiz, Piyush Patil, Mahdi Mirzazadeh, Eric Ni, Javier Snaider, Lijie Fan, Alexandre Fréchette, AJ Pierigiovanni, Shariq Iqbal, Kenton Lee, Claudio Fantacci, Jinwei Xing, Lisa Wang, Alex Irpan, David Raposo, Yi Luan, Zhuoyuan Chen, Harish Ganapathy, Kevin Hui, Jiazhong Nie, Isabelle Guyon, Heming Ge, Roopali Vij, Hui Zheng, Dayeong Lee, Alfonso Castaño, Khuslen Baatarsukh, Gabriel Ibagon, Alexandra Chronopoulou, Nicholas FitzGerald, Shashank Viswanadha, Safeen Huda, Rivka Moroshko, Georgi Stoyanov, Prateek Kolhar, Alain Vaucher, Ishaan Watts, Adhi Kuncoro, Henryk Michalewski, Satish Kambala, Bat-Orgil Batsaikhan, Alek Andreev, Irina Jurenka, Maigo Le, Qihang Chen, Wael Al Jishi, Sarah Chakera, Zhe Chen, Aditya Kini, Vikas Yadav, Aditya Siddhant, Ilia Labzovsky, Balaji Lakshminarayanan, Carrie Grimes Bostock, Pankil Botadra, Ankesh Anand, Colton Bishop, Sam Conway-Rahman, Mohit Agarwal, Yani Donchev, Achintya Singhal, Félix de Chaumont Quitry, Natalia Ponomareva, Nishant Agrawal, Bin Ni, Kalpesh Krishna, Masha Samsikova, John Karro, Yilun Du, Tamara von Glehn, Caden Lu, Christopher A. Choquette-Choo, Zhen Qin, Tingnan Zhang, Sicheng Li, Divya Tyam, Swaroop Mishra, Wing Lowe, Colin Ji, Weiyi Wang, Manaal Faruqui, Ambrose Slone, Valentin Dalibard, Arunachalam Narayanaswamy, John Lambert, Pierre-Antoine Manzagol, Dan Karliner, Andrew Bolt, Ivan Lobov, Aditya Kusupati, Chang Ye, Xuan Yang, Heiga Zen, Nelson George, Mukul Bhutani, Olivier Lacombe, Robert Riachi, Gagan Bansal, Rachel Soh, Yue Gao, Yang Yu, Adams Yu, Emily Nottage, Tania Rojas-Esponda, James Noraky, Manish Gupta, Ragha Kotikalapudi, Jichuan Chang, Sanja Deur, Dan Graur, Alex Mossin, Erin Farnese, Ricardo Figueira, Alexandre Moufarek, Austin Huang, Patrik Zochbauer, Ben Ingram, Tongzhou Chen, Zelin Wu, Adrià Puigdomènech, Leland Rechis, Da Yu, Sri Gayatri Sundara Padmanabhan, Rui Zhu, Chu-ling Ko, Andrea Banino, Samira Daruki, Aarush Selvan, Dhruva Bhaswar, Daniel Hernandez Diaz, Chen Su, Salvatore Scellato, Jennifer Brennan, Woohyun Han, Grace Chung, Priyanka Agrawal, Urvashi Khandelwal, Khe Chai Sim, Morgane Lustman, Sam Ritter, Kelvin Guu, Jiawei Xia, Prateek Jain, Emma Wang, Tyrone Hill, Mirko Rossini, Marija Kostelac, Tautvydas Misiunas, Amit Sabne, Kyuyeun Kim, Ahmet Iscen, Congchao Wang, José Leal, Ashwin Sreevatsa, Utku Evci, Manfred Warmuth, Saket Joshi, Daniel Suo, James Lottes, Garrett Honke, Brendan Jou, Stefani Karp, Jieru Hu, Himanshu Sahni, Adrien Ali Taïga, William Kong, Samrat Ghosh, Renshen Wang, Jay Pavagadhi, Natalie Axelsson, Nikolai Grigorev, Patrick Siegler, Rebecca Lin, Guohui Wang, Emilio Parisotto, Sharath Maddineni, Krishan Subudhi, Eyal Ben-David, Elena Pochernina, Orgad Keller, Thi Avrahami, Zhe Yuan, Pulkit Mehta, Jialu Liu, Sherry Yang, Wendy Kan, Katherine Lee, Tom Funkhouser, Derek Cheng, Hongzhi Shi, Archit Sharma, Joe Kelley, Matan Eyal, Yury Malkov, Corentin Tallec, Yuval Bahat, Shen Yan, Xintian, Wu, David Lindner, Chengda Wu, Avi Caciularu, Xiyang Luo, Rodolphe Jenatton, Tim Zaman, Yingying Bi, Ilya Kornakov, Ganesh Mallya, Daisuke Ikeda, Itay Karo, Anima Singh, Colin Evans, Praneeth Netrapalli, Vincent Nallatamby, Isaac Tian, Yannis Assael, Vikas Raunak, Victor Carbune, Ioana Bica, Lior Madmoni, Dee Cattle, Snchit Grover, Krishna Somandepalli, Sid Lall, Amelio Vázquez-Reina, Riccardo Patana, Jiaqi Mu, Pranav Talluri, Maggie Tran, Rajeev Aggarwal, RJ Skerry-Ryan, Jun Xu, Mike Burrows, Xiaoyue Pan, Edouard Yvinec, Di Lu, Zhiying Zhang, Duc Dung Nguyen, Hairong Mu, Gabriel Barcik, Helen Ran, Lauren Beltrone, Krzysztof Choromanski, Dia Kharrat, Samuel Albanie, Sean Purser-haskell, David Bieber, Carrie Zhang, Jing Wang, Tom Hudson, Zhiyuan Zhang, Han Fu, Johannes Mauerer, Mohammad Hossein Bateni, AJ Maschinot, Bing Wang, Muye Zhu, Arjun Pillai, Tobias Weyand, Shuang Liu, Oscar Akerlund, Fred Bertsch, Vittal Premachandran, Alicia Jin, Vincent Roulet, Peter de Boursac, Shubham Mittal, Ndaba Ndebele, Georgi Karadzhov, Sahra Ghalebikesabi, Ricky Liang, Allen Wu, Yale Cong, Nimesh Ghelani, Sumeet Singh, Bahar Fatemi, Warren, Chen, Charles Kwong, Alexey Kolganov, Steve Li, Richard Song, Chenkai Kuang, Sobhan Miryoosefi, Dale Webster, James Wendt, Arkadiusz Socala, Guolong Su, Artur Mendonça, Abhinav Gupta, Xiaowei Li, Tomy Tsai, Qiong, Hu, Kai Kang, Angie Chen, Sertan Girgin, Yongqin Xian, Andrew Lee, Nolan Ramsden, Leslie Baker, Madeleine Clare Elish, Varvara Krayvanova, Rishabh Joshi, Jiri Simsa, Yao-Yuan Yang, Piotr Ambroszczyk, Dipankar Ghosh, Arjun Kar, Yuan Shangguan, Yumeya Yamamori, Yaroslav Akulov, Andy Brock, Haotian Tang, Siddharth Vashishtha, Rich Munoz, Andreas Steiner, Kalyan Andra, Daniel Eppens, Qixuan Feng, Hayato Kobayashi, Sasha Goldshtein, Mona El Mahdy, Xin Wang, Jilei, Wang, Richard Killam, Tom Kwiatkowski, Kavya Kopparapu, Serena Zhan, Chao Jia, Alexei Bendebury, Sheryl Luo, Adrià Recasens, Timothy Knight, Jing Chen, Mohak Patel, YaGuang Li, Ben Withbroe, Dean Weesner, Kush Bhatia, Jie Ren, Danielle Eisenbud, Ebrahim Songhori, Yanhua Sun, Travis Choma, Tasos Kementsietsidis, Lucas Manning, Brian Roark, Wael Farhan, Jie Feng, Susheel Tatineni, James Cobon-Kerr, Yunjie Li, Lisa Anne Hendricks, Isaac Noble, Chris Breaux, Nate Kushman, Liqian Peng, Fuzhao Xue, Taylor Tobin, Jamie Rogers, Josh Lipschultz, Chris Alberti, Alexey Vlaskin, Mostafa Dehghani, Roshan Sharma, Tris Warkentin, Chen-Yu Lee, Benigno Uria, Da-Cheng Juan, Angad Chandorkar, Hila Sheftel, Ruibo Liu, Elnaz Davoodi, Borja De Balle Pigem, Kedar Dhamdhere, David Ross, Jonathan Hoech, Mahdis Mahdieh, Li Liu, Qiujia Li, Liam McCafferty, Chenxi Liu, Markus Mircea, Yunting Song, Omkar Savant, Alaa Saade, Colin Cherry, Vincent Hellendoorn, Siddharth Goyal, Paul Pucciarelli, David Vilar Torres, Zohar Yahav, Hyo Lee, Lars Lowe Sjoesund, Christo Kirov, Bo Chang, Deepanway Ghoshal, Lu Li, Gilles Baechler, Sébastien Pereira, Tara Sainath, Anudhyan Boral, Dominik Grewe, Afief Halumi, Nguyet Minh Phu, Tianxiao Shen, Marco Tulio Ribeiro, Dhriti Varma, Alex Kaskasoli, Vlad Feinberg, Navneet Potti, Jarrod Kahn, Matheus Wisniewski, Shakir Mohamed, Arnar Mar Hrafnkelsson, Bobak Shahriari, Jean-Baptiste Lespiau, Lisa Patel, Legg Yeung, Tom Paine, Lantao Mei, Alex Ramirez, Rakesh Shivanna, Li Zhong, Josh Woodward, Guilherme Tubone, Samira Khan, Heng Chen, Elizabeth Nielsen, Catalin Ionescu, Utsav Prabhu, Mingcen Gao, Qingze Wang, Sean Augenstein, Neesha Subramaniam, Jason Chang, Fotis Iliopoulos, Jiaming Luo, Myriam Khan, Weicheng Kuo, Denis Teplyashin, Florence Perot, Logan Kilpatrick, Amir Globerson, Hongkun Yu, Anfal Siddiqui, Nick Sukhanov, Arun Kandoor, Umang Gupta, Marco Andreetto, Moran Ambar, Donnie Kim, PaweÅ WesoÅowski, Sarah Perrin, Ben Limonchik, Wei Fan, Jim Stephan, Ian Stewart-Binks, Ryan Kappedal, Tong He, Sarah Cogan, Romina Datta, Tong Zhou, Jiayu Ye, Leandro Kieliger, Ana Ramalho, Kyle Kastner, Fabian Mentzer, Wei-Jen Ko, Arun Suggala, Tianhao Zhou, Shiraz Butt, Hana StrejÄek, Lior Belenki, Subhashini Venugopalan, Mingyang Ling, Evgenii Eltyshev, Yunxiao Deng, Geza Kovacs, Mukund Raghavachari, Hanjun Dai, Tal Schuster, Steven Schwarcz, Richard Nguyen, Arthur Nguyen, Gavin Buttimore, Shrestha Basu Mallick, Sudeep Gandhe, Seth Benjamin, Michal Jastrzebski, Le Yan, Sugato Basu, Chris Apps, Isabel Edkins, James Allingham, Immanuel Odisho, Tomas Kocisky, Jewel Zhao, Linting Xue, Apoorv Reddy, Chrysovalantis Anastasiou, Aviel Atias, Sam Redmond, Kieran Milan, Nicolas Heess, Herman Schmit, Allan Dafoe, Daniel Andor, Tynan Gangwani, Anca Dragan, Sheng Zhang, Ashyana Kachra, Gang Wu, Siyang Xue, Kevin Aydin, Siqi Liu, Yuxiang Zhou, Mahan Malihi, Austin Wu, Siddharth Gopal, Candice Schumann, Peter Stys, Alek Wang, Mirek Olšák, Dangyi Liu, Christian Schallhart, Yiran Mao, Demetra Brady, Hao Xu, Tomas Mery, Chawin Sitawarin, Siva Velusamy, Tom Cobley, Alex Zhai, Christian Walder, Nitzan Katz, Ganesh Jawahar, Chinmay Kulkarni, Antoine Yang, Adam Paszke, Yinan Wang, Bogdan Damoc, Zalán Borsos, Ray Smith, Jinning Li, Mansi Gupta, Andrei Kapishnikov, Sushant Prakash, Florian Luisier, Rishabh Agarwal, Will Grathwohl, Kuangyuan Chen, Kehang Han, Nikhil Mehta, Andrew Over, Shekoofeh Azizi, Lei Meng, Niccolò Dal Santo, Kelvin Zheng, Jane Shapiro, Igor Petrovski, Jeffrey Hui, Amin Ghafouri, Jasper Snoek, James Qin, Mandy Jordan, Caitlin Sikora, Jonathan Malmaud, Yuheng Kuang, Aga Åwietlik, Ruoxin Sang, Chongyang Shi, Leon Li, Andrew Rosenberg, Shubin Zhao, Andy Crawford, Jan-Thorsten Peter, Yun Lei, Xavier Garcia, Long Le, Todd Wang, Julien Amelot, Dave Orr, Praneeth Kacham, Dana Alon, Gladys Tyen, Abhinav Arora, James Lyon, Alex Kurakin, Mimi Ly, Theo Guidroz, Zhipeng Yan, Rina Panigrahy, Pingmei Xu, Thais Kagohara, Yong Cheng, Eric Noland, Jinhyuk Lee, Jonathan Lee, Cathy Yip, Maria Wang, Efrat Nehoran, Alexander Bykovsky, Zhihao Shan, Ankit Bhagatwala, Chaochao Yan, Jie Tan, Guillermo Garrido, Dan Ethier, Nate Hurley, Grace Vesom, Xu Chen, Siyuan Qiao, Abhishek Nayyar, Julian Walker, Paramjit Sandhu, Mihaela Rosca, Danny Swisher, Mikhail Dektiarev, Josh Dillon, George-Cristian Muraru, Manuel Tragut, Artiom Myaskovsky, David Reid, Marko Velic, Owen Xiao, Jasmine George, Mark Brand, Jing Li, Wenhao Yu, Shane Gu, Xiang Deng, François-Xavier Aubet, Soheil Hassas Yeganeh, Fred Alcober, Celine Smith, Trevor Cohn, Kay McKinney, Michael Tschannen, Ramesh Sampath, Gowoon Cheon, Liangchen Luo, Luyang Liu, Jordi Orbay, Hui Peng, Gabriela Botea, Xiaofan Zhang, Charles Yoon, Cesar Magalhaes, PaweÅ Stradomski, Ian Mackinnon, Steven Hemingray, Kumaran Venkatesan, Rhys May, Jaeyoun Kim, Alex Druinsky, Jingchen Ye, Zheng Xu, Terry Huang, Jad Al Abdallah, Adil Dostmohamed, Rachana Fellinger, Tsendsuren Munkhdalai, Akanksha Maurya, Peter Garst, Yin Zhang, Maxim Krikun, Simon Bucher, Aditya Srikanth Veerubhotla, Yaxin Liu, Sheng Li, Nishesh Gupta, Jakub Adamek, Hanwen Chen, Bernett Orlando, Aleksandr Zaks, Joost van Amersfoort, Josh Camp, Hui Wan, HyunJeong Choe, Zhichun Wu, Kate Olszewska, Weiren Yu, Archita Vadali, Martin Scholz, Daniel De Freitas, Jason Lin, Amy Hua, Xin Liu, Frank Ding, Yichao Zhou, Boone Severson, Katerina Tsihlas, Samuel Yang, Tammo Spalink, Varun Yerram, Helena Pankov, Rory Blevins, Ben Vargas, Sarthak Jauhari, Matt Miecnikowski, Ming Zhang, Sandeep Kumar, Clement Farabet, Charline Le Lan, Sebastian Flennerhag, Yonatan Bitton, Ada Ma, Arthur Bražinskas, Eli Collins, Niharika Ahuja, Sneha Kudugunta, Anna Bortsova, Minh Giang, Wanzheng Zhu, Ed Chi, Scott Lundberg, Alexey Stern, Subha Puttagunta, Jing Xiong, Xiao Wu, Yash Pande, Amit Jhindal, Daniel Murphy, Jon Clark, Marc Brockschmidt, Maxine Deines, Kevin R. McKee, Dan Bahir, Jiajun Shen, Minh Truong, Daniel McDuff, Andrea Gesmundo, Edouard Rosseel, Bowen Liang, Ken Caluwaerts, Jessica Hamrick, Joseph Kready, Mary Cassin, Rishikesh Ingale, Li Lao, Scott Pollom, Yifan Ding, Wei He, Lizzetth Bellot, Joana Iljazi, Ramya Sree Boppana, Shan Han, Tara Thompson, Amr Khalifa, Anna Bulanova, Blagoj Mitrevski, Bo Pang, Emma Cooney, Tian Shi, Rey Coaguila, Tamar Yakar, Marc'aurelio Ranzato, Nikola Momchev, Chris Rawles, Zachary Charles, Young Maeng, Yuan Zhang, Rishabh Bansal, Xiaokai Zhao, Brian Albert, Yuan Yuan, Sudheendra Vijayanarasimhan, Roy Hirsch, Vinay Ramasesh, Kiran Vodrahalli, Xingyu Wang, Arushi Gupta, DJ Strouse, Jianmo Ni, Roma Patel, Gabe Taubman, Zhouyuan Huo, Dero Gharibian, Marianne Monteiro, Hoi Lam, Shobha Vasudevan, Aditi Chaudhary, Isabela Albuquerque, Kilol Gupta, Sebastian Riedel, Chaitra Hegde, Avraham Ruderman, András György, Marcus Wainwright, Ashwin Chaugule, Burcu Karagol Ayan, Tomer Levinboim, Sam Shleifer, Yogesh Kalley, Vahab Mirrokni, Abhishek Rao, Prabakar Radhakrishnan, Jay Hartford, Jialin Wu, Zhenhai Zhu, Francesco Bertolini, Hao Xiong, Nicolas Serrano, Hamish Tomlinson, Myle Ott, Yifan Chang, Mark Graham, Jian Li, Marco Liang, Xiangzhu Long, Sebastian Borgeaud, Yanif Ahmad, Alex Grills, Diana Mincu, Martin Izzard, Yuan Liu, Jinyu Xie, Louis O'Bryan, Sameera Ponda, Simon Tong, Michelle Liu, Dan Malkin, Khalid Salama, Yuankai Chen, Rohan Anil, Anand Rao, Rigel Swavely, Misha Bilenko, Nina Anderson, Tat Tan, Jing Xie, Xing Wu, Lijun Yu, Oriol Vinyals, Andrey Ryabtsev, Rumen Dangovski, Kate Baumli, Daniel Keysers, Christian Wright, Zoe Ashwood, Betty Chan, Artem Shtefan, Yaohui Guo, Ankur Bapna, Radu Soricut, Steven Pecht, Sabela Ramos, Rui Wang, Jiahao Cai, Trieu Trinh, Paul Barham, Linda Friso, Eli Stickgold, Xiangzhuo Ding, Siamak Shakeri, Diego Ardila, Eleftheria Briakou, Phil Culliton, Adam Raveret, Jingyu Cui, David Saxton, Subhrajit Roy, Javad Azizi, Pengcheng Yin, Lucia Loher, Andrew Bunner, Min Choi, Faruk Ahmed, Eric Li, Yin Li, Shengyang Dai, Michael Elabd, Sriram Ganapathy, Shivani Agrawal, Yiqing Hua, Paige Kunkle, Sujeevan Rajayogam, Arun Ahuja, Arthur Conmy, Alex Vasiloff, Parker Beak, Christopher Yew, Jayaram Mudigonda, Bartek Wydrowski, Jon Blanton, Zhengdong Wang, Yann Dauphin, Zhuo Xu, Martin Polacek, Xi Chen, Hexiang Hu, Pauline Sho, Markus Kunesch, Mehdi Hafezi Manshadi, Eliza Rutherford, Bo Li, Sissie Hsiao, Iain Barr, Alex Tudor, Matija Kecman, Arsha Nagrani, Vladimir Pchelin, Martin Sundermeyer, Aishwarya P S, Abhijit Karmarkar, Yi Gao, Grishma Chole, Olivier Bachem, Isabel Gao, Arturo BC, Matt Dibb, Mauro Verzetti, Felix Hernandez-Campos, Yana Lunts, Matthew Johnson, Julia Di Trapani, Raphael Koster, Idan Brusilovsky, Binbin Xiong, Megha Mohabey, Han Ke, Joe Zou, Tea SaboliÄ, VÃctor Campos, John Palowitch, Alex Morris, Linhai Qiu, Pranavaraj Ponnuramu, Fangtao Li, Vivek Sharma, Kiranbir Sodhia, Kaan Tekelioglu, Aleksandr Chuklin, Madhavi Yenugula, Erika Gemzer, Theofilos Strinopoulos, Sam El-Husseini, Huiyu Wang, Yan Zhong, Edouard Leurent, Paul Natsev, Weijun Wang, Dre Mahaarachchi, Tao Zhu, Songyou Peng, Sami Alabed, Cheng-Chun Lee, Anthony Brohan, Arthur Szlam, GS Oh, Anton Kovsharov, Jenny Lee, Renee Wong, Megan Barnes, Gregory Thornton, Felix Gimeno, Omer Levy, Martin Sevenich, Melvin Johnson, Jonathan Mallinson, Robert Dadashi, Ziyue Wang, Qingchun Ren, Preethi Lahoti, Arka Dhar, Josh Feldman, Dan Zheng, Thatcher Ulrich, Liviu Panait, Michiel Blokzijl, Cip Baetu, Josip Matak, Jitendra Harlalka, Maulik Shah, Tal Marian, Daniel von Dincklage, Cosmo Du, Ruy Ley-Wild, Bethanie Brownfield, Max Schumacher, Yury Stuken, Shadi Noghabi, Sonal Gupta, Xiaoqi Ren, Eric Malmi, Felix Weissenberger, Blanca Huergo, Maria Bauza, Thomas Lampe, Arthur Douillard, Mojtaba Seyedhosseini, Roy Frostig, Zoubin Ghahramani, Kelvin Nguyen, Kashyap Krishnakumar, Chengxi Ye, Rahul Gupta, Alireza Nazari, Robert Geirhos, Pete Shaw, Ahmed Eleryan, Dima Damen, Jennimaria Palomaki, Ted Xiao, Qiyin Wu, Quan Yuan, Phoenix Meadowlark, Matthew Bilotti, Raymond Lin, Mukund Sridhar, Yannick Schroecker, Da-Woon Chung, Jincheng Luo, Trevor Strohman, Tianlin Liu, Anne Zheng, Jesse Emond, Wei Wang, Andrew Lampinen, Toshiyuki Fukuzawa, Folawiyo Campbell-Ajala, Monica Roy, James Lee-Thorp, Lily Wang, Iftekhar Naim, Tony, Nguy\~ên, Guy Bensky, Aditya Gupta, Dominika RogoziÅska, Justin Fu, Thanumalayan Sankaranarayana Pillai, Petar VeliÄkoviÄ, Shahar Drath, Philipp Neubeck, Vaibhav Tulsyan, Arseniy Klimovskiy, Don Metzler, Sage Stevens, Angel Yeh, Junwei Yuan, Tianhe Yu, Kelvin Zhang, Alec Go, Vincent Tsang, Ying Xu, Andy Wan, Isaac Galatzer-Levy, Sam Sobell, Abodunrinwa Toki, Elizabeth Salesky, Wenlei Zhou, Diego Antognini, Sholto Douglas, Shimu Wu, Adam Lelkes, Frank Kim, Paul Cavallaro, Ana Salazar, Yuchi Liu, James Besley, Tiziana Refice, Yiling Jia, Zhang Li, Michal Sokolik, Arvind Kannan, Jon Simon, Jo Chick, Avia Aharon, Meet Gandhi, Mayank Daswani, Keyvan Amiri, Vighnesh Birodkar, Abe Ittycheriah, Peter Grabowski, Oscar Chang, Charles Sutton, Zhixin, Lai, Umesh Telang, Susie Sargsyan, Tao Jiang, Raphael Hoffmann, Nicole Brichtova, Matteo Hessel, Jonathan Halcrow, Sammy Jerome, Geoff Brown, Alex Tomala, Elena Buchatskaya, Dian Yu, Sachit Menon, Pol Moreno, Yuguo Liao, Vicky Zayats, Luming Tang, SQ Mah, Ashish Shenoy, Alex Siegman, Majid Hadian, Okwan Kwon, Tao Tu, Nima Khajehnouri, Ryan Foley, Parisa Haghani, Zhongru Wu, Vaishakh Keshava, Khyatti Gupta, Tony Bruguier, Rui Yao, Danny Karmon, Luisa Zintgraf, Zhicheng Wang, Enrique Piqueras, Junehyuk Jung, Jenny Brennan, Diego Machado, Marissa Giustina, MH Tessler, Kamyu Lee, Qiao Zhang, Joss Moore, Kaspar Daugaard, Alexander Frömmgen, Jennifer Beattie, Fred Zhang, Daniel Kasenberg, Ty Geri, Danfeng Qin, Gaurav Singh Tomar, Tom Ouyang, Tianli Yu, Luowei Zhou, Rajiv Mathews, Andy Davis, Yaoyiran Li, Jai Gupta, Damion Yates, Linda Deng, Elizabeth Kemp, Ga-Young Joung, Sergei Vassilvitskii, Mandy Guo, Pallavi LV, Dave Dopson, Sami Lachgar, Lara McConnaughey, Himadri Choudhury, Dragos Dena, Aaron Cohen, Joshua Ainslie, Sergey Levi, Parthasarathy Gopavarapu, Polina Zablotskaia, Hugo Vallet, Sanaz Bahargam, Xiaodan Tang, Nenad Tomasev, Ethan Dyer, Daniel Balle, Hongrae Lee, William Bono, Jorge Gonzalez Mendez, Vadim Zubov, Shentao Yang, Ivor Rendulic, Yanyan Zheng, Andrew Hogue, Golan Pundak, Ralph Leith, Avishkar Bhoopchand, Michael Han, Mislav ŽaniÄ, Tom Schaul, Manolis Delakis, Tejas Iyer, Guanyu Wang, Harman Singh, Abdelrahman Abdelhamed, Tara Thomas, Siddhartha Brahma, Hilal Dib, Naveen Kumar, Wenxuan Zhou, Liang Bai, Pushkar Mishra, Jiao Sun, Valentin Anklin, Roykrong Sukkerd, Lauren Agubuzu, Anton Briukhov, Anmol Gulati, Maximilian Sieb, Fabio Pardo, Sara Nasso, Junquan Chen, Kexin Zhu, Tiberiu Sosea, Alex Goldin, Keith Rush, Spurthi Amba Hombaiah, Andreas Noever, Allan Zhou, Sam Haves, Mary Phuong, Jake Ades, Yi-ting Chen, Lin Yang, Joseph Pagadora, Stan Bileschi, Victor Cotruta, Rachel Saputro, Arijit Pramanik, Sean Ammirati, Dan Garrette, Kevin Villela, Tim Blyth, Canfer Akbulut, Neha Jha, Alban Rrustemi, Arissa Wongpanich, Chirag Nagpal, Yonghui Wu, Morgane Rivière, Sergey Kishchenko, Pranesh Srinivasan, Alice Chen, Animesh Sinha, Trang Pham, Bill Jia, Tom Hennigan, Anton Bakalov, Nithya Attaluri, Drew Garmon, Daniel Rodriguez, Dawid Wegner, Wenhao Jia, Evan Senter, Noah Fiedel, Denis Petek, Yuchuan Liu, Cassidy Hardin, Harshal Tushar Lehri, Joao Carreira, Sara Smoot, Marcel Prasetya, Nami Akazawa, Anca Stefanoiu, Chia-Hua Ho, Anelia Angelova, Kate Lin, Min Kim, Charles Chen, Marcin Sieniek, Alice Li, Tongfei Guo, Sorin Baltateanu, Pouya Tafti, Michael Wunder, Nadav Olmert, Divyansh Shukla, Jingwei Shen, Neel Kovelamudi, Balaji Venkatraman, Seth Neel, Romal Thoppilan, Jerome Connor, Frederik Benzing, Axel Stjerngren, Golnaz Ghiasi, Alex Polozov, Joshua Howland, Theophane Weber, Justin Chiu, Ganesh Poomal Girirajan, Andreas Terzis, Pidong Wang, Fangda Li, Yoav Ben Shalom, Dinesh Tewari, Matthew Denton, Roee Aharoni, Norbert Kalb, Heri Zhao, Junlin Zhang, Angelos Filos, Matthew Rahtz, Lalit Jain, Connie Fan, Vitor Rodrigues, Ruth Wang, Richard Shin, Jacob Austin, Roman Ring, Mariella Sanchez-Vargas, Mehadi Hassen, Ido Kessler, Uri Alon, Gufeng Zhang, Wenhu Chen, Yenai Ma, Xiance Si, Le Hou, Azalia Mirhoseini, Marc Wilson, Geoff Bacon, Becca Roelofs, Lei Shu, Gautam Vasudevan, Jonas Adler, Artur Dwornik, Tayfun Terzi, Matt Lawlor, Harry Askham, Mike Bernico, Xuanyi Dong, Chris Hidey, Kevin Kilgour, Gaël Liu, Surya Bhupatiraju, Luke Leonhard, Siqi Zuo, Partha Talukdar, Qing Wei, Aliaksei Severyn, VÃt ListÃk, Jong Lee, Aditya Tripathi, SK Park, Yossi Matias, Hao Liu, Alex Ruiz, Rajesh Jayaram, Jackson Tolins, Pierre Marcenac, Yiming Wang, Bryan Seybold, Henry Prior, Deepak Sharma, Jack Weber, Mikhail Sirotenko, Yunhsuan Sung, Dayou Du, Ellie Pavlick, Stefan Zinke, Markus Freitag, Max Dylla, Montse Gonzalez Arenas, Natan Potikha, Omer Goldman, Connie Tao, Rachita Chhaparia, Maria Voitovich, Pawan Dogra, Andrija RažnatoviÄ, Zak Tsai, Chong You, Oleaser Johnson, George Tucker, Chenjie Gu, Jae Yoo, Maryam Majzoubi, Valentin Gabeur, Bahram Raad, Rocky Rhodes, Kashyap Kolipaka, Heidi Howard, Geta Sampemane, Benny Li, Chulayuth Asawaroengchai, Duy Nguyen, Chiyuan Zhang, Timothee Cour, Xinxin Yu, Zhao Fu, Joe Jiang, Po-Sen Huang, Gabriela Surita, Iñaki Iturrate, Yael Karov, Michael Collins, Martin Baeuml, Fabian Fuchs, Shilpa Shetty, Swaroop Ramaswamy, Sayna Ebrahimi, Qiuchen Guo, Jeremy Shar, Gabe Barth-Maron, Sravanti Addepalli, Bryan Richter, Chin-Yi Cheng, Eugénie Rives, Fei Zheng, Johannes Griesser, Nishanth Dikkala, Yoel Zeldes, Ilkin Safarli, Dipanjan Das, Himanshu Srivastava, Sadh MNM Khan, Xin Li, Aditya Pandey, Larisa Markeeva, Dan Belov, Qiqi Yan, MikoÅaj RybiÅski, Tao Chen, Megha Nawhal, Michael Quinn, Vineetha Govindaraj, Sarah York, Reed Roberts, Roopal Garg, Namrata Godbole, Jake Abernethy, Anil Das, Lam Nguyen Thiet, Jonathan Tompson, John Nham, Neera Vats, Ben Caine, Wesley Helmholz, Francesco Pongetti, Yeongil Ko, James An, Clara Huiyi Hu, Yu-Cheng Ling, Julia Pawar, Robert Leland, Keisuke Kinoshita, Waleed Khawaja, Marco Selvi, Eugene Ie, Danila Sinopalnikov, Lev Proleev, Nilesh Tripuraneni, Michele Bevilacqua, Seungji Lee, Clayton Sanford, Dan Suh, Dustin Tran, Jeff Dean, Simon Baumgartner, Jens Heitkaemper, Sagar Gubbi, Kristina Toutanova, Yichong Xu, Chandu Thekkath, Keran Rong, Palak Jain, Annie Xie, Yan Virin, Yang Li, Lubo Litchev, Richard Powell, Tarun Bharti, Adam Kraft, Nan Hua, Marissa Ikonomidis, Ayal Hitron, Sanjiv Kumar, Loic Matthey, Sophie Bridgers, Lauren Lax, Ishaan Malhi, Ondrej Skopek, Ashish Gupta, Jiawei Cao, Mitchelle Rasquinha, Siim Põder, Wojciech Stokowiec, Nicholas Roth, Guowang Li, Michaël Sander, Joshua Kessinger, Vihan Jain, Edward Loper, Wonpyo Park, Michal Yarom, Liqun Cheng, Guru Guruganesh, Kanishka Rao, Yan Li, Catarina Barros, Mikhail Sushkov, Chun-Sung Ferng, Rohin Shah, Ophir Aharoni, Ravin Kumar, Tim McConnell, Peiran Li, Chen Wang, Fernando Pereira, Craig Swanson, Fayaz Jamil, Yan Xiong, Anitha Vijayakumar, Prakash Shroff, Kedar Soparkar, Jindong Gu, Livio Baldini Soares, Eric Wang, Kushal Majmundar, Aurora Wei, Kai Bailey, Nora Kassner, Chizu Kawamoto, Goran ŽužiÄ, Victor Gomes, Abhirut Gupta, Michael Guzman, Ishita Dasgupta, Xinyi Bai, Zhufeng Pan, Francesco Piccinno, Hadas Natalie Vogel, Octavio Ponce, Adrian Hutter, Paul Chang, Pan-Pan Jiang, Ionel Gog, Vlad Ionescu, James Manyika, Fabian Pedregosa, Harry Ragan, Zach Behrman, Ryan Mullins, Coline Devin, Aroonalok Pyne, Swapnil Gawde, Martin Chadwick, Yiming Gu, Sasan Tavakkol, Andy Twigg, Naman Goyal, Ndidi Elue, Anna Goldie, Srinivasan Venkatachary, Hongliang Fei, Ziqiang Feng, Marvin Ritter, Isabel Leal, Sudeep Dasari, Pei Sun, Alif Raditya Rochman, Brendan O'Donoghue, Yuchen Liu, Jim Sproch, Kai Chen, Natalie Clay, Slav Petrov, Sailesh Sidhwani, Ioana Mihailescu, Alex Panagopoulos, AJ Piergiovanni, Yunfei Bai, George Powell, Deep Karkhanis, Trevor Yacovone, Petr Mitrichev, Joe Kovac, Dave Uthus, Amir Yazdanbakhsh, David Amos, Steven Zheng, Bing Zhang, Jin Miao, Bhuvana Ramabhadran, Soroush Radpour, Shantanu Thakoor, Josh Newlan, Oran Lang, Orion Jankowski, Shikhar Bharadwaj, Jean-Michel Sarr, Shereen Ashraf, Sneha Mondal, Jun Yan, Ankit Singh Rawat, Sarmishta Velury, Greg Kochanski, Tom Eccles, Franz Och, Abhanshu Sharma, Ethan Mahintorabi, Alex Gurney, Carrie Muir, Vered Cohen, Saksham Thakur, Adam Bloniarz, Asier Mujika, Alexander Pritzel, Paul Caron, Altaf Rahman, Fiona Lang, Yasumasa Onoe, Petar Sirkovic, Jay Hoover, Ying Jian, Pablo Duque, Arun Narayanan, David Soergel, Alex Haig, Loren Maggiore, Shyamal Buch, Josef Dean, Ilya Figotin, Igor Karpov, Shaleen Gupta, Denny Zhou, Muhuan Huang, Ashwin Vaswani, Christopher Semturs, Kaushik Shivakumar, Yu Watanabe, Vinodh Kumar Rajendran, Eva Lu, Yanhan Hou, Wenting Ye, Shikhar Vashishth, Nana Nti, Vytenis Sakenas, Darren Ni, Doug DeCarlo, Michael Bendersky, Sumit Bagri, Nacho Cano, Elijah Peake, Simon Tokumine, Varun Godbole, Carlos GuÃa, Tanya Lando, Vittorio Selo, Seher Ellis, Danny Tarlow, Daniel Gillick, Alessandro Epasto, Siddhartha Reddy Jonnalagadda, Meng Wei, Meiyan Xie, Ankur Taly, Michela Paganini, Mukund Sundararajan, Daniel Toyama, Ting Yu, Dessie Petrova, Aneesh Pappu, Rohan Agrawal, Senaka Buthpitiya, Justin Frye, Thomas Buschmann, Remi Crocker, Marco Tagliasacchi, Mengchao Wang, Da Huang, Sagi Perel, Brian Wieder, Hideto Kazawa, Weiyue Wang, Jeremy Cole, Himanshu Gupta, Ben Golan, Seojin Bang, Nitish Kulkarni, Ken Franko, Casper Liu, Doug Reid, Sid Dalmia, Jay Whang, Kevin Cen, Prasha Sundaram, Johan Ferret, Berivan Isik, Lucian Ionita, Guan Sun, Anna Shekhawat, Muqthar Mohammad, Philip Pham, Ronny Huang, Karthik Raman, Xingyi Zhou, Ross Mcilroy, Austin Myers, Sheng Peng, Jacob Scott, Paul Covington, Sofia Erell, Pratik Joshi, João Gabriel Oliveira, Natasha Noy, Tajwar Nasir, Jake Walker, Vera Axelrod, Tim Dozat, Pu Han, Chun-Te Chu, Eugene Weinstein, Anand Shukla, Shreyas Chandrakaladharan, Petra Poklukar, Bonnie Li, Ye Jin, Prem Eruvbetine, Steven Hansen, Avigail Dabush, Alon Jacovi, Samrat Phatale, Chen Zhu, Steven Baker, Mo Shomrat, Yang Xiao, Jean Pouget-Abadie, Mingyang Zhang, Fanny Wei, Yang Song, Helen King, Yiling Huang, Yun Zhu, Ruoxi Sun, Juliana Vicente Franco, Chu-Cheng Lin, Sho Arora, Hui, Li, Vivian Xia, Luke Vilnis, Mariano Schain, Kaiz Alarakyia, Laurel Prince, Aaron Phillips, Caleb Habtegebriel, Luyao Xu, Huan Gui, Santiago Ontanon, Lora Aroyo, Karan Gill, Peggy Lu, Yash Katariya, Dhruv Madeka, Shankar Krishnan, Shubha Srinivas Raghvendra, James Freedman, Yi Tay, Gaurav Menghani, Peter Choy, Nishita Shetty, Dan Abolafia, Doron Kukliansky, Edward Chou, Jared Lichtarge, Ken Burke, Ben Coleman, Dee Guo, Larry Jin, Indro Bhattacharya, Victoria Langston, Yiming Li, Suyog Kotecha, Alex Yakubovich, Xinyun Chen, Petre Petrov, Tolly Powell, Yanzhang He, Corbin Quick, Kanav Garg, Dawsen Hwang, Yang Lu, Srinadh Bhojanapalli, Kristian Kjems, Ramin Mehran, Aaron Archer, Hado van Hasselt, Ashwin Balakrishna, JK Kearns, Meiqi Guo, Jason Riesa, Mikita Sazanovich, Xu Gao, Chris Sauer, Chengrun Yang, XiangHai Sheng, Thomas Jimma, Wouter Van Gansbeke, Vitaly Nikolaev, Wei Wei, Katie Millican, Ruizhe Zhao, Justin Snyder, Levent Bolelli, Maura O'Brien, Shawn Xu, Fei Xia, Wentao Yuan, Arvind Neelakantan, David Barker, Sachin Yadav, Hannah Kirkwood, Farooq Ahmad, Joel Wee, Jordan Grimstad, Boyu Wang, Matthew Wiethoff, Shane Settle, Miaosen Wang, Charles Blundell, Jingjing Chen, Chris Duvarney, Grace Hu, Olaf Ronneberger, Alex Lee, Yuanzhen Li, Abhishek Chakladar, Alena Butryna, Georgios Evangelopoulos, Guillaume Desjardins, Jonni Kanerva, Henry Wang, Averi Nowak, Nick Li, Alyssa Loo, Art Khurshudov, Laurent El Shafey, Nagabhushan Baddi, Karel Lenc, Yasaman Razeghi, Tom Lieber, Amer Sinha, Xiao Ma, Yao Su, James Huang, Asahi Ushio, Hanna Klimczak-PluciÅska, Kareem Mohamed, JD Chen, Simon Osindero, Stav Ginzburg, Lampros Lamprou, Vasilisa Bashlovkina, Duc-Hieu Tran, Ali Khodaei, Ankit Anand, Yixian Di, Ramy Eskander, Manish Reddy Vuyyuru, Jasmine Liu, Aishwarya Kamath, Roman Goldenberg, Mathias Bellaiche, Juliette Pluto, Bill Rosgen, Hassan Mansoor, William Wong, Suhas Ganesh, Eric Bailey, Scott Baird, Dan Deutsch, Jinoo Baek, Xuhui Jia, Chansoo Lee, Abe Friesen, Nathaniel Braun, Kate Lee, Amayika Panda, Steven M. Hernandez, Duncan Williams, Jianqiao Liu, Ethan Liang, Arnaud Autef, Emily Pitler, Deepali Jain, Phoebe Kirk, Oskar Bunyan, Jaume Sanchez Elias, Tongxin Yin, Machel Reid, Aedan Pope, Nikita Putikhin, Bidisha Samanta, Sergio Guadarrama, Dahun Kim, Simon Rowe, Marcella Valentine, Geng Yan, Alex Salcianu, David Silver, Gan Song, Richa Singh, Shuai Ye, Hannah DeBalsi, Majd Al Merey, Eran Ofek, Albert Webson, Shibl Mourad, Ashwin Kakarla, Silvio Lattanzi, Nick Roy, Evgeny Sluzhaev, Christina Butterfield, Alessio Tonioni, Nathan Waters, Sudhindra Kopalle, Jason Chase, James Cohan, Girish Ramchandra Rao, Robert Berry, Michael Voznesensky, Shuguang Hu, Kristen Chiafullo, Sharat Chikkerur, George Scrivener, Ivy Zheng, Jeremy Wiesner, Wolfgang Macherey, Timothy Lillicrap, Fei Liu, Brian Walker, David Welling, Elinor Davies, Yangsibo Huang, Lijie Ren, Nir Shabat, Alessandro Agostini, Mariko Iinuma, Dustin Zelle, Rohit Sathyanarayana, Andrea D'olimpio, Morgan Redshaw, Matt Ginsberg, Ashwin Murthy, Mark Geller, Tatiana Matejovicova, Ayan Chakrabarti, Ryan Julian, Christine Chan, Qiong Hu, Daniel Jarrett, Manu Agarwal, Jeshwanth Challagundla, Tao Li, Sandeep Tata, Wen Ding, Maya Meng, Zhuyun Dai, Giulia Vezzani, Shefali Garg, Jannis Bulian, Mary Jasarevic, Honglong Cai, Harish Rajamani, Adam Santoro, Florian Hartmann, Chen Liang, Bartek Perz, Apoorv Jindal, Fan Bu, Sungyong Seo, Ryan Poplin, Adrian Goedeckemeyer, Badih Ghazi, Nikhil Khadke, Leon Liu, Kevin Mather, Mingda Zhang, Ali Shah, Alex Chen, Jinliang Wei, Keshav Shivam, Yuan Cao, Donghyun Cho, Angelo Scorza Scarpati, Michael Moffitt, Clara Barbu, Ivan Jurin, Ming-Wei Chang, Hongbin Liu, Hao Zheng, Shachi Dave, Christine Kaeser-Chen, Xiaobin Yu, Alvin Abdagic, Lucas Gonzalez, Yanping Huang, Peilin Zhong, Cordelia Schmid, Bryce Petrini, Alex Wertheim, Jifan Zhu, Hoang Nguyen, Kaiyang Ji, Yanqi Zhou, Tao Zhou, Fangxiaoyu Feng, Regev Cohen, David Rim, Shubham Milind Phal, Petko Georgiev, Ariel Brand, Yue Ma, Wei Li, Somit Gupta, Chao Wang, Pavel Dubov, Jean Tarbouriech, Kingshuk Majumder, Huijian Li, Norman Rink, Apurv Suman, Yang Guo, Yinghao Sun, Arun Nair, Xiaowei Xu, Mohamed Elhawaty, Rodrigo Cabrera, Guangxing Han, Julian Eisenschlos, Junwen Bai, Yuqi Li, Yamini Bansal, Thibault Sellam, Mina Khan, Hung Nguyen, Justin Mao-Jones, Nikos Parotsidis, Jake Marcus, Cindy Fan, Roland Zimmermann, Yony Kochinski, Laura Graesser, Feryal Behbahani, Alvaro Caceres, Michael Riley, Patrick Kane, Sandra Lefdal, Rob Willoughby, Paul Vicol, Lun Wang, Shujian Zhang, Ashleah Gill, Yu Liang, Gautam Prasad, Soroosh Mariooryad, Mehran Kazemi, Zifeng Wang, Kritika Muralidharan, Paul Voigtlaender, Jeffrey Zhao, Huanjie Zhou, Nina D'Souza, Aditi Mavalankar, Séb Arnold, Nick Young, Obaid Sarvana, Chace Lee, Milad Nasr, Tingting Zou, Seokhwan Kim, Lukas Haas, Kaushal Patel, Neslihan Bulut, David Parkinson, Courtney Biles, Dmitry Kalashnikov, Chi Ming To, Aviral Kumar, Jessica Austin, Alex Greve, Lei Zhang, Megha Goel, Yeqing Li, Sergey Yaroshenko, Max Chang, Abhishek Jindal, Geoff Clark, Hagai Taitelbaum, Dale Johnson, Ofir Roval, Jeongwoo Ko, Anhad Mohananey, Christian Schuler, Shenil Dodhia, Ruichao Li, Kazuki Osawa, Claire Cui, Peng Xu, Rushin Shah, Tao Huang, Ela Gruzewska, Nathan Clement, Mudit Verma, Olcan Sercinoglu, Hai Qian, Viral Shah, Masa Yamaguchi, Abhinit Modi, Takahiro Kosakai, Thomas Strohmann, Junhao Zeng, Beliz Gunel, Jun Qian, Austin Tarango, Krzysztof JastrzÄbski, Robert David, Jyn Shan, Parker Schuh, Kunal Lad, Willi Gierke, Mukundan Madhavan, Xinyi Chen, Mark Kurzeja, Rebeca Santamaria-Fernandez, Dawn Chen, Alexandra Cordell, Yuri Chervonyi, Frankie Garcia, Nithish Kannen, Vincent Perot, Nan Ding, Shlomi Cohen-Ganor, Victor Lavrenko, Junru Wu, Georgie Evans, Cicero Nogueira dos Santos, Madhavi Sewak, Ashley Brown, Andrew Hard, Joan Puigcerver, Zeyu Zheng, Yizhong Liang, Evgeny Gladchenko, Reeve Ingle, Uri First, Pierre Sermanet, Charlotte Magister, Mihajlo VelimiroviÄ, Sashank Reddi, Susanna Ricco, Eirikur Agustsson, Hartwig Adam, Nir Levine, David Gaddy, Dan Holtmann-Rice, Xuanhui Wang, Ashutosh Sathe, Abhijit Guha Roy, Blaž BrataniÄ, Alen Carin, Harsh Mehta, Silvano Bonacina, Nicola De Cao, Mara Finkelstein, Verena Rieser, Xinyi Wu, Florent Altché, Dylan Scandinaro, Li Li, Nino Vieillard, Nikhil Sethi, Garrett Tanzer, Zhi Xing, Shibo Wang, Parul Bhatia, Gui Citovsky, Thomas Anthony, Sharon Lin, Tianze Shi, Shoshana Jakobovits, Gena Gibson, Raj Apte, Lisa Lee, Mingqing Chen, Arunkumar Byravan, Petros Maniatis, Kellie Webster, Andrew Dai, Pu-Chin Chen, Jiaqi Pan, Asya Fadeeva, Zach Gleicher, Thang Luong, Niket Kumar Bhumihar
Abstract:
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
中文: Gemini 2.X模型系列推出了具备顶尖编程、推理和多模态理解能力的Gemini 2.5 Pro,以及高效节能的Gemini 2.5 Flash等型号,全面覆盖从高性能到低成本的应用需求,为复杂智能体工作流提供完整解决方案。
English: The Gemini 2.X model family introduces Gemini 2.5 Pro, a state-of-the-art multimodal model with exceptional coding, reasoning, and long-context video processing capabilities, alongside the efficient Gemini 2.5 Flash and earlier cost-effective models, collectively covering the full spectrum of performance and cost for advanced agentic workflows.
Authors:Sana Kang, Myeongseok Gwon, Su Young Kwon, Jaewook Lee, Andrew Lan, Bhiksha Raj, Rita Singh
Abstract:
Vocabulary acquisition poses a significant challenge for second-language (L2) learners, especially when learning typologically distant languages such as English and Korean, where phonological and structural mismatches complicate vocabulary learning. Recently, large language models (LLMs) have been used to generate keyword mnemonics by leveraging similar keywords from a learner's first language (L1) to aid in acquiring L2 vocabulary. However, most of this research has focused on native English speakers learning other languages, rather than the reverse. In this paper, we present PhoniTale, a novel cross-lingual mnemonic generation system that retrieves L1 keyword sequence based on phonological similarity and uses LLMs to generate mnemonics. We evaluate PhoniTale using both automated metrics and human evaluations, comparing its output to mnemonics created by humans and by previous automated approaches. To assess practical effectiveness, we also conduct a short-term recall test measuring mnemonic helpfulness. Our findings show that PhoniTale performs comparably to human-authored mnemonics. We also highlight key areas for future improvement in mnemonic quality and methodology.
中文:PhoniTale是一种基于语音相似性利用大语言模型生成跨语言记忆辅助工具的新系统,研究表明其在帮助二语学习者克服语言结构差异掌握词汇方面,与人工编写的记忆法效果相当。
English: PhoniTale is a novel system that uses large language models to generate cross-lingual mnemonic aids based on phonological similarity, showing comparable effectiveness to human-authored mnemonics in helping second-language learners acquire vocabulary despite structural differences between languages.
Authors:Kaiqi Yang, Hang Li, Yucheng Chu, Ahreum Han, Yasemin Copur-Gencturk, Jiliang Tang, Hui Liu
Abstract:
Professional development (PD) serves as the cornerstone for teacher tutors to grasp content knowledge. However, providing equitable and timely PD opportunities for teachers poses significant challenges. To address this issue, we introduce I-VIP (Intelligent Virtual Interactive Program), an intelligent tutoring platform for teacher professional development, driven by large language models (LLMs) and supported by multi-agent frameworks. This platform offers a user-friendly conversational interface and allows users to employ a variety of interactive tools to facilitate question answering, knowledge comprehension, and reflective summarization while engaging in dialogue. To underpin the functionality of this platform, including knowledge expectation analysis, response scoring and classification, and feedback generation, the multi-agent frameworks are leveraged to enhance the accuracy of judgments and mitigate the issue of missing key points.
中文摘要:I-VIP是一个基于大语言模型和多智能体框架的智能辅导平台,通过对话界面和交互工具为教师提供公平的专业发展机会。
English Summary: I-VIP is an intelligent tutoring platform that uses large language models and multi-agent frameworks to provide equitable professional development for teachers through interactive tools and conversational interfaces.
Authors:Bohao Wang, Zitao Shuai, Fenghao Zhu, Chongwen Huang, Yongliang Shen, Zhaoyang Zhang, Qianqian Yang, Sami Muhaidat, Merouane Debbah
Abstract:
Multimodal fingerprinting is a crucial technique to sub-meter 6G integrated sensing and communications (ISAC) localization, but two hurdles block deployment: (i) the contribution each modality makes to the target position varies with the operating conditions such as carrier frequency, and (ii) spatial and fingerprint ambiguities markedly undermine localization accuracy, especially in non-line-of-sight (NLOS) scenarios. To solve these problems, we introduce SCADF-MoE, a spatial-context aware dynamic fusion network built on a soft mixture-of-experts backbone. SCADF-MoE first clusters neighboring points into short trajectories to inject explicit spatial context. Then, it adaptively fuses channel state information, angle of arrival profile, distance, and gain through its learnable MoE router, so that the most reliable cues dominate at each carrier band. The fused representation is fed to a modality-task MoE that simultaneously regresses the coordinates of every vertex in the trajectory and its centroid, thereby exploiting inter-point correlations. Finally, an auxiliary maximum-mean-discrepancy loss enforces expert diversity and mitigates gradient interference, stabilizing multi-task training. On three real urban layouts and three carrier bands (2.6, 6, 28 GHz), the model delivers consistent sub-meter MSE and halves unseen-NLOS error versus the best prior work. To our knowledge, this is the first work that leverages large-scale multimodal MoE for frequency-robust ISAC localization.
Chinese: SCADF-MoE是一种基于空间上下文感知的动态融合网络,通过专家混合框架自适应整合多模态数据,在多个载波频段实现亚米级定位精度,并显著降低了非视距场景下的定位误差。
English: SCADF-MoE is a spatial-context aware dynamic fusion network that adaptively integrates multimodal data through a mixture-of-experts framework, achieving sub-meter localization accuracy across multiple carrier frequencies and significantly reducing errors in non-line-of-sight scenarios.
Authors:Jingwei Zhao, Yuhua Wen, Qifei Li, Minchi Hu, Yingying Zhou, Jingyao Xue, Junyang Wu, Yingming Gao, Zhengqi Wen, Jianhua Tao, Ya Li
Abstract:
Intent recognition aims to identify users' underlying intentions, traditionally focusing on text in natural language processing. With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals. Recently, the introduction of Transformer-based models has led to notable breakthroughs in this domain. This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges. It provides researchers with insights into the latest developments in multimodal intent recognition (MIR) and directions for future research.
中文摘要:本文综述了意图识别的深度学习方法,重点介绍了从单模态到多模态技术的转变,并探讨了相关数据集、应用及未来研究方向。
English Summary: This article surveys deep learning methods for intent recognition, highlighting the evolution from unimodal to multimodal approaches using Transformer-based models and discussing datasets, applications, and future research directions.
Authors:Mehrnoosh Mirtaheri, Ryan A. Rossi, Sungchul Kim, Kanak Mahadik, Tong Yu, Xiang Chen, Mohammad Rostami
Abstract:
Temporal Knowledge Graph (TKG) completion models traditionally assume access to the entire graph during training. This overlooks challenges stemming from the evolving nature of TKGs, such as: (i) the model's requirement to generalize and assimilate new knowledge, and (ii) the task of managing new or unseen entities that often have sparse connections. In this paper, we present an incremental training framework specifically designed for TKGs, aiming to address entities that are either not observed during training or have sparse connections. Our approach combines a model-agnostic enhancement layer with a weighted sampling strategy, that can be augmented to and improve any existing TKG completion method. The enhancement layer leverages a broader, global definition of entity similarity, which moves beyond mere local neighborhood proximity of GNN-based methods. The weighted sampling strategy employed in training accentuates edges linked to infrequently occurring entities. We evaluate our method on two benchmark datasets, and demonstrate that our framework outperforms existing methods in total link prediction, inductive link prediction, and in addressing long-tail entities. Notably, our method achieves a 10\% improvement and a 15\% boost in MRR for these datasets. The results underscore the potential of our approach in mitigating catastrophic forgetting and enhancing the robustness of TKG completion methods, especially in an incremental training context
中文: 本文提出了一种时序知识图谱的增量训练框架,通过结合全局实体相似性增强层和加权采样策略,有效提升了现有模型对新实体和稀疏关系的处理能力,在链接预测任务中取得了显著性能提升。
English: This paper introduces an incremental training framework for Temporal Knowledge Graphs that enhances existing completion methods by using a global entity similarity layer and weighted sampling to improve performance on new entities and sparse connections, achieving significant gains in link prediction metrics.
Authors:Ayce Idil Aytekin, Chuqiao Li, Diogo Luvizon, Rishabh Dabral, Martin Oswald, Marc Habermann, Christian Theobalt
Abstract:
Most monocular and physics-based human pose tracking methods, while achieving state-of-the-art results, suffer from artifacts when the scene does not have a strictly flat ground plane or when the camera is moving. Moreover, these methods are often evaluated on in-the-wild real world videos without ground-truth data or on synthetic datasets, which fail to model the real world light transport, camera motion, and pose-induced appearance and geometry changes. To tackle these two problems, we introduce MoviCam, the first non-synthetic dataset containing ground-truth camera trajectories of a dynamically moving monocular RGB camera, scene geometry, and 3D human motion with human-scene contact labels. Additionally, we propose PhysDynPose, a physics-based method that incorporates scene geometry and physical constraints for more accurate human motion tracking in case of camera motion and non-flat scenes. More precisely, we use a state-of-the-art kinematics estimator to obtain the human pose and a robust SLAM method to capture the dynamic camera trajectory, enabling the recovery of the human pose in the world frame. We then refine the kinematic pose estimate using our scene-aware physics optimizer. From our new benchmark, we found that even state-of-the-art methods struggle with this inherently challenging setting, i.e. a moving camera and non-planar environments, while our method robustly estimates both human and camera poses in world coordinates.
中文: 现有的人体姿态跟踪方法在移动相机和非平面场景中表现不佳,因此我们提出了MoviCam数据集和PhysDynPose方法,通过结合场景几何和物理约束,实现了更鲁棒的运动估计。
English: Most existing human pose tracking methods struggle with moving cameras and non-flat scenes, so we introduce the MoviCam dataset and PhysDynPose method to address these challenges by incorporating scene geometry and physical constraints for more robust motion estimation.
Authors:Chengxu Liu, Lu Qi, Jinshan Pan, Xueming Qian, Ming-Hsuan Yang
Abstract:
Since acquiring large amounts of realistic blurry-sharp image pairs is difficult and expensive, learning blind image deblurring from unpaired data is a more practical and promising solution. Unfortunately, dominant approaches rely heavily on adversarial learning to bridge the gap from blurry domains to sharp domains, ignoring the complex and unpredictable nature of real-world blur patterns. In this paper, we propose a novel diffusion model (DM)-based framework, dubbed \ours, for image deblurring by learning spatially varying texture prior from unpaired data. In particular, \ours performs DM to generate the prior knowledge that aids in recovering the textures of blurry images. To implement this, we propose a Texture Prior Encoder (TPE) that introduces a memory mechanism to represent the image textures and provides supervision for DM training. To fully exploit the generated texture priors, we present the Texture Transfer Transformer layer (TTformer), in which a novel Filter-Modulated Multi-head Self-Attention (FM-MSA) efficiently removes spatially varying blurring through adaptive filtering. Furthermore, we implement a wavelet-based adversarial loss to preserve high-frequency texture details. Extensive evaluations show that \ours provides a promising unsupervised deblurring solution and outperforms SOTA methods in widely-used benchmarks.
中文: 本文提出的基于扩散模型的框架 \ours 通过从未配对数据中学习空间变化的纹理先验,有效消除复杂的真实世界模糊,借助纹理先验编码器和纹理转换变换器等创新组件,在广泛使用的基准测试中超越了现有最优方法。
English: The proposed diffusion model-based framework, \ours, learns spatially varying texture priors from unpaired data to effectively remove complex real-world blur, outperforming state-of-the-art methods through novel components like the Texture Prior Encoder and Texture Transfer Transformer.
Authors:Quang Truong, Zhikai Chen, Mingxuan Ju, Tong Zhao, Neil Shah, Jiliang Tang
Abstract:
Relational databases underpin critical infrastructure across a wide range of domains, yet the design of generalizable pre-training strategies for learning from relational databases remains an open challenge due to task heterogeneity. Specifically, there exist infinitely many possible downstream tasks, as tasks are defined based on relational schema graphs, temporal dependencies, and SQL-defined label logics. An effective pre-training framework is desired to take these factors into account in order to obtain task-aware representations. By incorporating knowledge of the underlying distribution that drives label generation, downstream tasks can benefit from relevant side-channel information. To bridge this gap, we introduce Task Vector Estimation (TVE), a novel pre-training framework that constructs predictive supervisory signals via set-based aggregation over schema traversal graphs, explicitly modeling next-window relational dynamics. We formalize our approach through an information-theoretic lens, demonstrating that task-informed representations retain more relevant signals than those obtained without task priors. Extensive experiments on the RelBench benchmark show that TVE consistently outperforms traditional pre-training baselines. Our findings advocate for pre-training objectives that encode task heterogeneity and temporal structure as design principles for predictive modeling on relational databases.
中文:任务向量估计(TVE)框架通过基于模式遍历图的集合聚合显式建模任务异质性和时序动态,解决了关系数据库通用预训练的难题,在基准测试中持续优于传统预训练方法。
English: The Task Vector Estimation (TVE) framework addresses the challenge of generalizable pre-training for relational databases by explicitly modeling task heterogeneity and temporal dynamics through schema traversal graphs, outperforming traditional methods in benchmark evaluations.
Authors:Yu Xia, Yiran Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A. Rossi, Lina Yao, Julian McAuley
Abstract:
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
中文摘要:本文提出的SAND框架通过自洽行动采样和执行引导的行动评估,使大语言模型代理能在决策前对候选行动进行显式思考,相比现有方法实现了20%的性能提升。
English Summary: The proposed SAND framework enhances LLM agents by enabling explicit deliberation over candidate actions through self-consistency sampling and execution-guided critique, achieving a 20% performance improvement over baseline methods.
Authors:Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, Siheng Chen
Abstract:
The rapid advancements of AI agents have ignited the long-held ambition of leveraging them to accelerate scientific discovery. Achieving this goal requires a deep understanding of the frontiers of human knowledge. As such, Humanity's Last Exam (HLE) provides an exceptionally challenging touchstone for evaluating scientific AI agents. In this work, we aim to construct the foundational architecture for general-purpose agents and validate the capabilities through leading performance on HLE. To achieve this, we introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting flexibly with external tools during its reasoning process. This agent, guided by the conceptualization of code as an interaction language, can flexibly leverage built-in Python libraries and our customized tools to augment the reasoning. We further scale its capabilities through X-Masters, a scattered-and-stacked agentic workflow that systematically enhances breadth and depth of reasoning. Our open-source solution, X-Masters, sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing OpenAI's and Google's Deep Research (26.6% and 26.9%) and becoming the first to exceed the 30% threshold. This work allows us to gain a deeper understanding of complex task-solving and accumulates valuable experience that can inform future advancements, guiding subsequent model training.
中文: 本文提出的X-Master工具增强推理智能体在"人类终极考试"基准测试中创下最新记录,通过灵活运用外部工具和先进推理能力,展现了加速科学发现的潜力。
English: This paper introduces X-Master, a tool-augmented reasoning agent that sets a new state-of-the-art record on the Humanity's Last Exam benchmark, demonstrating its potential to accelerate scientific discovery through flexible interaction with external tools and advanced reasoning capabilities.
Authors:Runcong Zhao, Qinglin Zhu, Hainiu Xu, Bin Liang, Lin Gui, Yulan He
Abstract:
Understanding character relationships is essential for interpreting complex narratives and conducting socially grounded AI research. However, manual annotation is time-consuming and low in coverage, while large language models (LLMs) often produce hallucinated or logically inconsistent outputs. We present SymbolicThought, a human-in-the-loop framework that combines LLM-based extraction with symbolic reasoning. The system constructs editable character relationship graphs, refines them using seven types of logical constraints, and enables real-time validation and conflict resolution through an interactive interface. To support logical supervision and explainable social analysis, we release a dataset of 160 interpersonal relationships with corresponding logical structures. Experiments show that SymbolicThought improves annotation accuracy and consistency while significantly reducing time cost, offering a practical tool for narrative understanding, explainable AI, and LLM evaluation.
Chinese: SymbolicThought是一种人机协同框架,结合大语言模型提取与符号推理,构建可编辑的角色关系图,通过逻辑约束和交互界面显著提升叙事理解与可解释AI的标注准确性和效率。
English: SymbolicThought is a human-in-the-loop framework that integrates LLM-based extraction with symbolic reasoning to construct editable character relationship graphs, improving annotation accuracy and efficiency for narrative understanding and explainable AI.
Authors:Tiantian Feng, Brandon M Booth, Karel Mundnich, Emily Zhou, Benjamin Girault, Kristina Lerman, Shrikanth Narayanan
Abstract:
Sleep is important for everyday functioning, overall well-being, and quality of life. Recent advances in wearable sensing technology have enabled continuous, noninvasive, and cost-effective monitoring of sleep patterns in real-world natural living settings. Wrist-worn devices, in particular, are capable of tracking sleep patterns using accelerometers and heart rate sensors. To support sleep research in naturalistic environments using wearable sensors, we introduce the TILES-2018 Sleep Benchmark dataset, which we make publicly available to the research community. This dataset was collected over a 10-week period from 139 hospital employees and includes over 6,000 unique sleep recordings, alongside self-reported survey data from each participant, which includes sleep quality, stress, and anxiety among other measurements. We present in-depth analyses of sleep patterns by combining the TILES-2018 Sleep Benchmark dataset with a previously released dataset (TILES-2018), which follows a similar study protocol. Our analyses include sleep duration, sleep stages, and sleep diaries. Moreover, we report machine learning benchmarks using this dataset as a testbed for tasks including sleep stage classification, prediction of self-reported sleep quality, and classifying demographics. Overall, this dataset provides a valuable resource for advancing foundational studies in sleep behavior modeling.
中文:TILES-2018睡眠基准数据集通过可穿戴传感器支持真实环境下的睡眠研究,提供超过6000条睡眠记录和调查数据,以推动睡眠行为建模与机器学习应用的发展。
English: The TILES-2018 Sleep Benchmark dataset enables real-world sleep research through wearable sensors, offering over 6,000 sleep recordings and survey data to advance sleep behavior modeling and machine learning applications.
Authors:Bo Yang, Ruihuai Liang, Weixin Li, Han Wang, Xuelin Cao, Zhiwen Yu, Samson Lasaulce, Mérouane Debbah, Mohamed-Slim Alouini, H. Vincent Poor, Chau Yuen
Abstract:
While interest in the application of generative AI (GenAI) in network optimization has surged in recent years, its rapid progress has often overshadowed critical limitations intrinsic to generative models that remain insufficiently examined in existing literature. This survey provides a comprehensive review and critical analysis of GenAI in network optimization. We focus on the two dominant paradigms of GenAI including generative diffusion models (GDMs) and large pre-trained models (LPTMs), and organize our discussion around a categorization we introduce, dividing network optimization problems into two primary formulations: one-shot optimization and Markov decision process (MDP). We first trace key works, including foundational contributions from the AI community, and categorize current efforts in network optimization. We also review frontier applications of GDMs and LPTMs in other networking tasks, providing additional context. Furthermore, we present theoretical generalization bounds for GDMs in both one-shot and MDP settings, offering insights into the fundamental factors affecting model performance. Most importantly, we reflect on the overestimated perception of GenAI's general capabilities and caution against the all-in-one illusion it may convey. We highlight critical limitations, including difficulties in constraint satisfying, limited concept understanding, and the inherent probabilistic nature of outputs. We also propose key future directions, such as bridging the gap between generation and optimization. Although they are increasingly integrated in implementations, they differ fundamentally in both objectives and underlying mechanisms, necessitating a deeper understanding of their theoretical connections. Ultimately, this survey aims to provide a structured overview and a deeper insight into the strengths, limitations, and potential of GenAI in network optimization.
中文: 本综述批判性审视生成式AI在网络优化中的应用,揭示其被高估的能力及约束满足等核心局限,并提出弥合生成与优化差距的未来研究方向。
English: This survey critically reviews generative AI's application in network optimization, highlighting its overestimated capabilities and key limitations like constraint satisfaction issues, while proposing future directions to bridge generation and optimization gaps.
Authors:Chengxu Liu, Lu Qi, Jinshan Pan, Xueming Qian, Ming-Hsuan Yang
Abstract:
Unpaired image dehazing has attracted increasing attention due to its flexible data requirements during model training. Dominant methods based on contrastive learning not only introduce haze-unrelated content information, but also ignore haze-specific properties in the frequency domain (\ie,~haze-related degradation is mainly manifested in the amplitude spectrum). To address these issues, we propose a novel frequency domain-based diffusion model, named \ours, for fully exploiting the beneficial knowledge in unpaired clear data. In particular, inspired by the strong generative ability shown by Diffusion Models (DMs), we tackle the dehazing task from the perspective of frequency domain reconstruction and perform the DMs to yield the amplitude spectrum consistent with the distribution of clear images. To implement it, we propose an Amplitude Residual Encoder (ARE) to extract the amplitude residuals, which effectively compensates for the amplitude gap from the hazy to clear domains, as well as provide supervision for the DMs training. In addition, we propose a Phase Correction Module (PCM) to eliminate artifacts by further refining the phase spectrum during dehazing with a simple attention mechanism. Experimental results demonstrate that our \ours outperforms other state-of-the-art methods on both synthetic and real-world datasets.
中文: 本文提出了一种基于频域的新型扩散模型,通过幅度谱重建和相位校正有效解决非配对图像去雾问题,在多个数据集上优于现有方法。
English: This paper introduces a novel frequency domain-based diffusion model that leverages amplitude spectrum reconstruction and phase correction to effectively address unpaired image dehazing, outperforming existing methods on various datasets.
Authors:Ruoyu Wang, Junda Wu, Yu Xia, Tong Yu, Ryan A. Rossi, Julian McAuley, Lina Yao
Abstract:
Large language model-based agents, empowered by in-context learning (ICL), have demonstrated strong capabilities in complex reasoning and tool-use tasks. However, existing works have shown that the effectiveness of ICL is highly sensitive to the choice of demonstrations, with suboptimal examples often leading to unstable or degraded performance. While prior work has explored example selection, including in some agentic or multi-step settings, existing approaches typically rely on heuristics or task-specific designs and lack a general, theoretically grounded criterion for what constitutes an effective demonstration across reasoning steps. Therefore, it is non-trivial to develop a principled, general-purpose method for selecting demonstrations that consistently benefit agent performance. In this paper, we address this challenge with DICE, Dynamic In-Context Example Selection for LLM Agents, a theoretically grounded ICL framework for agentic tasks that selects the most relevant demonstrations at each step of reasoning. Our approach decomposes demonstration knowledge into transferable and non-transferable components through a causal lens, showing how the latter can introduce spurious dependencies that impair generalization. We further propose a stepwise selection criterion with a formal guarantee of improved agent performance. Importantly, DICE is a general, framework-agnostic solution that can be integrated as a plug-in module into existing agentic frameworks without any additional training cost. Extensive experiments across diverse domains demonstrate our method's effectiveness and generality, highlighting the importance of principled, context-aware demo selection for robust and efficient LLM agents.
中文摘要:本文提出DICE框架,通过因果视角将示范知识分解为可迁移与不可迁移成分,在推理的每一步动态选择最相关示例,从而提升大语言模型代理的性能并具备理论保证。
English Summary: This paper introduces DICE, a theoretically grounded framework that dynamically selects the most relevant demonstrations at each reasoning step to enhance LLM agent performance by decomposing knowledge into transferable and non-transferable components.
Authors:Patrick Iff, Paul Bruegger, Marcin Chrapek, Maciej Besta, Torsten Hoefler
Abstract:
Advances in embedding models for text, image, audio, and video drive progress across multiple domains, including retrieval-augmented generation, recommendation systems, vehicle/person reidentification, and face recognition. Many applications in these domains require an efficient method to retrieve items that are close to a given query in the embedding space while satisfying a filter condition based on the item's attributes, a problem known as Filtered Approximate Nearest Neighbor Search (FANNS). In this work, we present a comprehensive survey and taxonomy of FANNS methods and analyze how they are benchmarked in the literature. By doing so, we identify a key challenge in the current FANNS landscape: the lack of diverse and realistic datasets, particularly ones derived from the latest transformer-based text embedding models. To address this, we introduce a novel dataset consisting of embedding vectors for the abstracts of over 2.7 million research articles from the arXiv repository, accompanied by 11 real-world attributes such as authors and categories. We benchmark a wide range of FANNS methods on our novel dataset and find that each method has distinct strengths and limitations; no single approach performs best across all scenarios. ACORN, for example, supports various filter types and performs reliably across dataset scales but is often outperformed by more specialized methods. SeRF shows excellent performance for range filtering on ordered attributes but cannot handle categorical attributes. Filtered-DiskANN and UNG excel on the medium-scale dataset but fail on the large-scale dataset, highlighting the challenge posed by transformer-based embeddings, which are often more than an order of magnitude larger than earlier embeddings. We conclude that no universally best method exists.
嵌入模型的进步推动了检索和识别任务的发展,但针对过滤近似最近邻搜索(FANNS)缺乏多样化数据集的问题,本研究引入了基于arXiv的新型数据集,发现没有单一的FANNS方法能在所有场景中普遍最优。
Embedding model advancements enhance retrieval and recognition tasks, yet the lack of diverse datasets for Filtered Approximate Nearest Neighbor Search (FANNS) is addressed by introducing a novel arXiv-based dataset, revealing that no single FANNS method excels universally across all scenarios.
Authors:Yuman Gao, Ruibin Zhang, Tiancheng Lai, Yanjun Cao, Chao Xu, Fei Gao
Abstract:
Terrestrial-aerial bimodal vehicles, which integrate the high mobility of aerial robots with the long endurance of ground robots, offer significant potential for autonomous exploration. Given the inherent energy and time constraints in practical exploration tasks, we present a hierarchical framework for the bimodal vehicle to utilize its flexible locomotion modalities for exploration. Beginning with extracting environmental information to identify informative regions, we generate a set of potential bimodal viewpoints. To adaptively manage energy and time constraints, we introduce an extended Monte Carlo Tree Search approach that strategically optimizes both modality selection and viewpoint sequencing. Combined with an improved bimodal vehicle motion planner, we present a complete bimodal energy- and time-aware exploration system. Extensive simulations and deployment on a customized real-world platform demonstrate the effectiveness of our system.
中文摘要:本研究提出一种分层框架,通过环境信息提取和扩展蒙特卡洛树搜索方法,实现陆空双模机器人在能量与时间约束下的自主探索,有效优化运动模式选择与路径规划。
English Summary: This study introduces a hierarchical framework for terrestrial-aerial bimodal vehicles that uses environmental analysis and an extended Monte Carlo Tree Search to optimize energy-efficient exploration through adaptive modality selection and motion planning.
Authors:Weisong Zhao, Jingkai Zhou, Xiangyu Zhu, Weihua Chen, Xiao-Yu Zhang, Zhen Lei, Fan Wang
Abstract:
Video Super-Resolution (VSR) has achieved significant progress through diffusion models, effectively addressing the over-smoothing issues inherent in GAN-based methods. Despite recent advances, three critical challenges persist in VSR community: 1) Inconsistent modeling of temporal dynamics in foundational models; 2) limited high-frequency detail recovery under complex real-world degradations; and 3) insufficient evaluation of detail enhancement and 4K super-resolution, as current methods primarily rely on 720P datasets with inadequate details. To address these challenges, we propose RealisVSR, a high-frequency detail-enhanced video diffusion model with three core innovations: 1) Consistency Preserved ControlNet (CPC) architecture integrated with the Wan2.1 video diffusion to model the smooth and complex motions and suppress artifacts; 2) High-Frequency Rectified Diffusion Loss (HR-Loss) combining wavelet decomposition and HOG feature constraints for texture restoration; 3) RealisVideo-4K, the first public 4K VSR benchmark containing 1,000 high-definition video-text pairs. Leveraging the advanced spatio-temporal guidance of Wan2.1, our method requires only 5-25% of the training data volume compared to existing approaches. Extensive experiments on VSR benchmarks (REDS, SPMCS, UDM10, YouTube-HQ, VideoLQ, RealisVideo-720P) demonstrate our superiority, particularly in ultra-high-resolution scenarios.
中文: 视频超分辨率通过扩散模型取得进展,RealisVSR解决了时间动态建模不一致、高频细节恢复不足的问题,并引入首个4K基准,以少量训练数据实现优越性能。
English: Video Super-Resolution has advanced with diffusion models like RealisVSR, which tackles temporal inconsistencies, enhances high-frequency details, and introduces a 4K benchmark to improve performance with minimal training data.
Authors:Qiong Wu, Yu Xie, Pingyi Fan, Dong Qin, Kezhi Wang, Nan Cheng, Khaled B. Letaief
Abstract:
In this paper, we propose a general digital twin edge computing network comprising multiple vehicles and a server. Each vehicle generates multiple computing tasks within a time slot, leading to queuing challenges when offloading tasks to the server. The study investigates task offloading strategies, queue stability, and resource allocation. Lyapunov optimization is employed to transform long-term constraints into tractable short-term decisions. To solve the resulting problem, an in-context learning approach based on large language model (LLM) is adopted, replacing the conventional multi-agent reinforcement learning (MARL) framework. Experimental results demonstrate that the LLM-based method achieves comparable or even superior performance to MARL.
中文: 本研究提出了一种数字孪生边缘计算网络,利用李雅普诺夫优化和基于大语言模型的上下文学习方法处理任务卸载与资源分配,其性能达到甚至超越了传统的多智能体强化学习。
English: This study introduces a digital twin edge computing network where vehicles generate tasks, and employs Lyapunov optimization and an LLM-based in-context learning approach to manage task offloading and resource allocation, achieving performance comparable to or better than traditional multi-agent reinforcement learning.
Authors:Xiao Xu, Qiong Wu, Pingyi Fan, Kezhi Wang, Nan Cheng, Wen Chen, Khaled B. Letaief
Abstract:
In this paper, we consider the fair access problem and the Age of Information (AoI) under 5G New Radio (NR) Vehicle-to-Infrastructure (V2I) Mode 2 in vehicular networks. Specifically, vehicles follow Mode 2 to communicate with Roadside Units (RSUs) to obtain accurate data for driving assistance.Nevertheless, vehicles often have different velocity when they are moving in adjacent lanes, leading to difference in RSU dwelltime and communication duration. This results in unfair access to network resources, potentially influencing driving safety. To ensure the freshness of received data, the AoI should be analyzed. Mode 2 introduces a novel preemption mechanism, necessitating simultaneous optimization of fair access and AoI to guarantee timely and relevant data delivery. We propose a joint optimization framework for vehicular network, defining a fairness index and employing Stochastic Hybrid Systems (SHS) to model AoI under preemption mechanism. By adaptively adjusting the selection window of Semi-Persistent Scheduling (SPS) in Mode 2, we address the optimization of fairness and AoI. We apply a large language model (LLM)-Based Multi-objective Evolutionary Algorithm Based on Decomposition (MOEA/D) to solve this problem. Simulation results demonstrate the effectiveness of our scheme in balancing fair access and minimizing AoI.
中文: 本文针对5G NR V2I模式2下车联网的公平接入和信息年龄优化问题,提出通过自适应调整半持续调度选择窗口的联合优化框架,采用基于大语言模型的多目标进化算法实现资源公平性与数据新鲜度的平衡。
English: This paper addresses fair access and Age of Information (AoI) optimization in 5G NR V2I Mode 2 vehicular networks by proposing a joint framework that adaptively adjusts SPS selection windows using an LLM-based MOEA/D algorithm to balance resource fairness and data freshness.
Authors:Xiaomei Zhang, Hanyu Zheng, Xiangyu Zhu, Jinghuan Wei, Junhong Zou, Zhen Lei, Zhaoxiang Zhang
Abstract:
Large Vision-Language Models (LVLMs) have shown impressive capabilities across a range of tasks that integrate visual and textual understanding, such as image captioning and visual question answering. These models are trained on large-scale image and video datasets paired with text, enabling them to bridge visual perception and natural language processing. However, their application to scientific domains, especially in interpreting complex field data commonly used in the natural sciences, remains underexplored. In this work, we introduce FieldLVLM, a novel framework designed to improve large vision-language models' understanding of field data. FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning. The field-aware language generation strategy leverages a special-purpose machine learning pipeline to extract key physical features from field data, such as flow classification, Reynolds number, and vortex patterns. This information is then converted into structured textual descriptions that serve as a dataset. The data-compressed multimodal model tuning focuses on LVLMs with these generated datasets, using a data compression strategy to reduce the complexity of field inputs and retain only the most informative values. This ensures compatibility with the models language decoder and guides its learning more effectively. Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data. Our findings suggest that this approach opens up new possibilities for applying large vision-language models to scientific research, helping bridge the gap between large models and domain-specific discovery.
大型视觉语言模型在多模态任务中表现出色,但在科学领域数据处理上存在局限,为此提出的FieldLVLM框架通过领域感知语言生成和数据压缩调优,显著超越现有方法,为科学研究与人工智能的结合开辟了新途径。
Large Vision-Language Models show strong multimodal capabilities but struggle with scientific field data, leading to the development of FieldLVLM, which enhances understanding through field-aware language generation and data-compressed tuning, significantly outperforming existing methods and bridging AI with scientific discovery.
Authors:Ramaneswaran Selvakumar, Ashish Seth, Nishit Anand, Utkarsh Tyagi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
Abstract:
The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 9 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
中文摘要:MultiVox基准测试旨在解决现有评估在衡量语音助手整合副语言特征和视觉线索以生成情境感知响应能力方面的不足,研究发现尽管人类擅长此类任务,现有模型仍难以实现真正的多模态理解。
English Summary: The MultiVox benchmark is introduced to address the limitations of current evaluations in assessing voice assistants' ability to integrate paralinguistic speech features and visual cues for context-aware responses, revealing that existing models struggle with these multimodal tasks despite human proficiency.
Authors:Ramaneswaran Selvakumar, Ashish Seth, Nishit Anand, Utkarsh Tyagi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
Abstract:
The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
中文摘要:MultiVox基准测试旨在解决现有评估在衡量语音助手整合副语言特征和视觉线索以生成情境感知响应能力方面的不足,研究发现尽管人类擅长此类任务,现有模型仍难以实现真正的多模态理解。
English Summary: The MultiVox benchmark is introduced to address the limitations of current evaluations in assessing voice assistants' ability to integrate paralinguistic speech features and visual cues for context-aware responses, revealing that existing models struggle with these multimodal tasks despite human proficiency.
Authors:Benzhao Tang, Shiyu Yang, Zhitao Shen, Wenjie Zhang, Xuemin Lin, Zhihong Tian
Abstract:
Log data is a vital resource for capturing system events and states. With the increasing complexity and widespread adoption ofmodern software systems and IoT devices, the daily volume of log generation has surged to tens of petabytes, leading to significant collection and storage costs. To address this challenge, lossless log compression has emerged as an effective solution, enabling substantial resource savings without compromising log information. In this paper, we first conduct a characterization study on extensive public log datasets and identify four key observations. Building on these insights, we propose LogLite, a lightweight, plug-and-play, streaming lossless compression algorithm designed to handle both TEXT and JSON logs throughout their life cycle. LogLite requires no predefined rules or pre-training and is inherently adaptable to evolving log structures. Our evaluation shows that, compared to state-of-the-art baselines, LogLite achieves Pareto optimality in most scenarios, delivering an average improvement of up to 67.8% in compression ratio and up to 2.7 $\times$ in compression speed.
Chinese: LogLite是一种轻量级、即插即用的流式无损压缩算法,无需预定义规则或预训练,即可显著提升日志数据的压缩比和压缩速度。
English: LogLite is a lightweight, plug-and-play, streaming lossless compression algorithm that achieves significant improvements in compression ratio and speed for log data without requiring predefined rules or pre-training.
Authors:Jiajun Yu, Nanhe Chen, Guodong Liu, Chao Xu, Fei Gao, Yanjun Cao
Abstract:
Optimization has been widely used to generate smooth trajectories for motion planning. However, existing trajectory optimization methods show weakness when dealing with large-scale long trajectories. Recent advances in parallel computing have accelerated optimization in some fields, but how to efficiently solve trajectory optimization via parallelism remains an open question. In this paper, we propose a novel trajectory optimization framework based on the Consensus Alternating Direction Method of Multipliers (CADMM) algorithm, which decomposes the trajectory into multiple segments and solves the subproblems in parallel. The proposed framework reduces the time complexity to O(1) per iteration to the number of segments, compared to O(N) of the state-of-the-art (SOTA) approaches. Furthermore, we introduce a closed-form solution that integrates convex linear and quadratic constraints to speed up the optimization, and we also present numerical solutions for general inequality constraints. A series of simulations and experiments demonstrate that our approach outperforms the SOTA approach in terms of efficiency and smoothness. Especially for a large-scale trajectory, with one hundred segments, achieving over a tenfold speedup. To fully explore the potential of our algorithm on modern parallel computing architectures, we deploy our framework on a GPU and show high performance with thousands of segments.
中文: 本文提出了一种基于CADMM算法的新型轨迹优化框架,通过并行处理轨迹段将每次迭代的时间复杂度降至O(1),相比现有方法在大规模轨迹上实现了超过十倍的加速效果。
English: This paper introduces a novel trajectory optimization framework using the CADMM algorithm that enables parallel processing of trajectory segments, achieving O(1) time complexity per iteration and demonstrating over tenfold speedup for large-scale trajectories compared to existing methods.
Authors:Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro
Abstract:
We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.
Chinese: Audio Flamingo 3 (AF3) 是一个完全开源的大型音频语言模型,在语音、声音和音乐的推理与理解方面实现了最先进的性能,仅使用开源数据就在超过20个基准测试中超越了现有模型。
English: Audio Flamingo 3 (AF3) is a fully open-source large audio-language model that achieves state-of-the-art performance in reasoning and understanding across speech, sound, and music, surpassing existing models on over 20 benchmarks despite using only open-source data.
Authors:Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin
Abstract:
Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.
Chinese: DreamVLA是一种新型视觉-语言-动作框架,通过整合全面的世界知识预测和块状注意力机制,建立感知-预测-动作循环以提升机器人操作性能,在真实任务中取得了76.7%的成功率。
English: DreamVLA is a novel vision-language-action framework that integrates comprehensive world knowledge forecasting with a block-wise attention mechanism to enhance robot manipulation by establishing a perception-prediction-action loop, achieving a 76.7% success rate in real-world tasks.
Authors:Yingxuan Yang, Ying Wen, Jun Wang, Weinan Zhang
Abstract:
The rise of Large Language Models (LLMs) has transformed AI agents from passive computational tools into autonomous economic actors. This shift marks the emergence of the agent-centric economy, in which agents take on active economic roles-exchanging value, making strategic decisions, and coordinating actions with minimal human oversight. To realize this vision, we propose Agent Exchange (AEX), a specialized auction platform designed to support the dynamics of the AI agent marketplace. AEX offers an optimized infrastructure for agent coordination and economic participation. Inspired by Real-Time Bidding (RTB) systems in online advertising, AEX serves as the central auction engine, facilitating interactions among four ecosystem components: the User-Side Platform (USP), which translates human goals into agent-executable tasks; the Agent-Side Platform (ASP), responsible for capability representation, performance tracking, and optimization; Agent Hubs, which coordinate agent teams and participate in AEX-hosted auctions; and the Data Management Platform (DMP), ensuring secure knowledge sharing and fair value attribution. We outline the design principles and system architecture of AEX, laying the groundwork for agent-based economic infrastructure in future AI ecosystems.
Chinese: 以智能体为中心的经济模式使AI智能体成为自主经济参与者,为此我们提出Agent Exchange(AEX)这一专门拍卖平台,通过借鉴实时竞价系统构建了包含用户平台、智能体平台、智能体枢纽和数据管理平台的完整架构,为未来AI经济生态奠定基础。
English: The emergence of the agent-centric economy has transformed AI agents into autonomous economic actors, leading to the proposal of Agent Exchange (AEX), a specialized auction platform designed to optimize agent coordination and economic participation through a system inspired by Real-Time Bidding.
Authors:Patrik Okanovic, Sameer Deshmukh, Grzegorz Kwasniewski, Kentaro Katayama, Takumi Honda, Maciej Besta, Torsten Hoefler
Abstract:
The energy consumption of large-scale ML models is dominated by data movement - shuffling billions of parameters across memory hierarchies and data centers. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable sparsification method applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss. Our fused, highly optimized Sparse MLP kernel delivers up to 16.7x speedup over dense MLPs across 9 architectures and 8 datasets, resulting in up to 1.6x inference speedup, 1.11x pretraining speedup and up to 3.12x inference memory usage reduction. BLaST enables the next generation of large-scale AI systems by reducing energy use, memory footprint, and latency.
Chinese: BLaST提出了一种鲁棒的稀疏化方法,可在MLP权重中实现高达95%的稀疏度且精度损失可忽略,为大规模AI系统带来显著的加速和内存优化。
English: BLaST introduces a robust sparsification method that achieves up to 95% sparsity in MLP weights with minimal accuracy loss, delivering significant speedups and memory reductions for large-scale AI systems.
Authors:Tianning Chai, Chancharik Mitra, Brandon Huang, Gautam Rajendrakumar Gare, Zhiqiu Lin, Assaf Arbelle, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Deva Ramanan, Roei Herzig
Abstract:
Aligning Large Language Models (LLMs) and Large Multimodal Models (LMMs) to human preferences is a central challenge in improving the quality of the models' generative outputs for real-world applications. A common approach is to use reward modeling to encode preferences, enabling alignment via post-training using reinforcement learning. However, traditional reward modeling is not easily adaptable to new preferences because it requires a separate reward model, commonly trained on large preference datasets. To address this, we introduce Activation Reward Models (Activation RMs) -- a novel few-shot reward modeling method that leverages activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning. Activation RMs outperform existing few-shot reward modeling approaches such as LLM-as-a-judge with in-context learning, voting-based scoring, and token probability scoring on standard reward modeling benchmarks. Furthermore, we demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, highlighting their utility for safety-critical applications. Toward this end, we propose PreferenceHack, a novel few-shot setting benchmark, the first to test reward models on reward hacking in a paired preference format. Finally, we show that Activation RM achieves state-of-the-art performance on this benchmark, surpassing even GPT-4o.
Chinese: 激活奖励模型(Activation RMs)提出了一种新颖的小样本奖励建模方法,通过激活引导以最少监督构建对齐的奖励信号,在基准测试中超越现有方法并有效抑制奖励破解行为,实现了最先进的性能。
English: Activation Reward Models (Activation RMs) introduce a novel few-shot method that uses activation steering to create aligned reward signals with minimal supervision, outperforming existing approaches and mitigating reward hacking while achieving state-of-the-art results on benchmarks.
Authors:Ryandhimas E. Zezario, Sabato M. Siniscalchi, Fei Chen, Hsin-Min Wang, Yu Tsao
Abstract:
Given the critical role of non-intrusive speech intelligibility assessment in hearing aids (HA), this paper enhances its performance by introducing Feature Importance across Domains (FiDo). We estimate feature importance on spectral and time-domain acoustic features as well as latent representations of Whisper. Importance weights are calculated per frame, and based on these weights, features are projected into new spaces, allowing the model to focus on important areas early. Next, feature concatenation is performed to combine the features before the assessment module processes them. Experimental results show that when FiDo is incorporated into the improved multi-branched speech intelligibility model MBI-Net+, RMSE can be reduced by 7.62% (from 26.10 to 24.11). MBI-Net+ with FiDo also achieves a relative RMSE reduction of 3.98% compared to the best system in the 2023 Clarity Prediction Challenge. These results validate FiDo's effectiveness in enhancing neural speech assessment in HA.
Chinese Summary: 本文提出跨领域特征重要性方法(FiDo),通过动态加权和投影声学特征来增强助听器中的非侵入式语音清晰度评估,实验证明其可将均方根误差降低7.62%。
English Summary: This paper introduces Feature Importance across Domains (FiDo) to enhance non-intrusive speech intelligibility assessment in hearing aids by dynamically weighting and projecting features, achieving a 7.62% RMSE reduction in experiments.
Authors:Di Wen, Junwei Zheng, Ruiping Liu, Yi Xu, Kunyu Peng, Rainer Stiefelhagen
Abstract:
Industrial assembly tasks increasingly demand rapid adaptation to complex procedures and varied components, yet are often conducted in environments with limited computing, connectivity, and strict privacy requirements. These constraints make conventional cloud-based or fully autonomous solutions impractical for factory deployment. This paper introduces a mobile-device-based assistant system for industrial training and operational support, enabling real-time, semi-hands-free interaction through on-device perception and voice interfaces. The system integrates lightweight object detection, speech recognition, and Retrieval-Augmented Generation (RAG) into a modular on-device pipeline that operates entirely on-device, enabling intuitive support for part handling and procedure understanding without relying on manual supervision or cloud services. To enable scalable training, we adopt an automated data construction pipeline and introduce a two-stage refinement strategy to improve visual robustness under domain shift. Experiments on our generated dataset, i.e., Gear8, demonstrate improved robustness to domain shift and common visual corruptions. A structured user study further confirms its practical viability, with positive user feedback on the clarity of the guidance and the quality of the interaction. These results indicate that our framework offers a deployable solution for real-time, privacy-preserving smart assistance in industrial environments. We will release the Gear8 dataset and source code upon acceptance.
中文: 本文提出了一种基于移动设备的工业培训与操作辅助系统,通过集成轻量级感知与语音交互实现无需云端依赖的实时隐私保护指导,实验验证了其在领域适应性及用户交互质量方面的显著优势。
English: This paper presents a mobile-based assistant system for industrial training and support, utilizing on-device perception and voice interfaces to provide real-time, privacy-preserving guidance without cloud dependency, validated by improved robustness and positive user feedback.
Authors:Heitor R. Medeiros, Atif Belal, Masih Aminbeidokhti, Eric Granger, Marco Pedersoli
Abstract:
Object detection (OD) in infrared (IR) imagery is critical for low-light and nighttime applications. However, the scarcity of large-scale IR datasets forces models to rely on weights pre-trained on RGB images. While fine-tuning on IR improves accuracy, it often compromises robustness under distribution shifts due to the inherent modality gap between RGB and IR. To address this, we introduce LLVIP-C and FLIR-C, two cross-modality out-of-distribution (OOD) benchmarks built by applying corruption to standard IR datasets. Additionally, to fully leverage the complementary knowledge from RGB and infrared trained models, we propose WiSE-OD, a weight-space ensembling method with two variants: WiSE-OD$_{ZS}$, which combines RGB zero-shot and IR fine-tuned weights, and WiSE-OD$_{LP}$, which blends zero-shot and linear probing. Evaluated across three RGB-pretrained detectors and two robust baselines, WiSE-OD improves both cross-modality and corruption robustness without any additional training or inference cost.
中文摘要:本文提出了WiSE-OD权重空间集成方法,通过融合RGB和红外模型知识来提升红外图像中的目标检测性能,并建立了两个跨模态基准数据集,以解决RGB与红外数据间模态差异导致的鲁棒性问题。
English Summary: This paper introduces WiSE-OD, a weight-space ensembling method that enhances object detection in infrared imagery by combining RGB and IR model knowledge, along with two cross-modality benchmarks to address robustness issues caused by the modality gap between RGB and IR data.
Authors:Yan Li, Guangyi Chen, Yunlong Deng, Zijian Li, Zeyu Tang, Anpeng Wu, Kun Zhang
Abstract:
Most existing methods for adapting models to out-of-distribution (OOD) domains rely on invariant representation learning to eliminate the influence of biased features. However, should bias always be eliminated -- and if not, when should it be retained, and how can it be leveraged? To address these questions, we first present a theoretical analysis that explores the conditions under which biased features can be identified and effectively utilized. Building on this theoretical foundation, we introduce a novel framework that strategically leverages bias to complement invariant representations during inference. The framework comprises two key components that leverage bias in both direct and indirect ways: (1) using invariance as guidance to extract predictive ingredients from bias, and (2) exploiting identified bias to estimate the environmental condition and then use it to explore appropriate bias-aware predictors to alleviate environment gaps. We validate our approach through experiments on both synthetic datasets and standard domain generalization benchmarks. Results consistently demonstrate that our method outperforms existing approaches, underscoring its robustness and adaptability.
中文: 本文提出了一种新颖框架,通过在推理过程中策略性地利用有偏特征来补充不变表示,并在合成数据集和领域泛化基准上的实验验证中展现出优于现有方法的性能。
English: This paper proposes a novel framework that strategically leverages biased features to complement invariant representations during inference, demonstrating superior performance over existing methods through theoretical analysis and experimental validation on synthetic datasets and domain generalization benchmarks.
Authors:Hui-Guan Yuan, Ryandhimas E. Zezario, Shafique Ahmed, Hsin-Min Wang, Kai-Lung Hua, Yu Tsao
Abstract:
Hearing loss simulation models are essential for hearing aid deployment. However, existing models have high computational complexity and latency, which limits real-time applications and lack direct integration with speech processing systems. To address these issues, we propose Neuro-MSBG, a lightweight end-to-end model with a personalized audiogram encoder for effective time-frequency modeling. Experiments show that Neuro-MSBG supports parallel inference and retains the intelligibility and perceptual quality of the original MSBG, with a Spearman's rank correlation coefficient (SRCC) of 0.9247 for Short-Time Objective Intelligibility (STOI) and 0.8671 for Perceptual Evaluation of Speech Quality (PESQ). Neuro-MSBG reduces simulation runtime by a factor of 46 (from 0.970 seconds to 0.021 seconds for a 1 second input), further demonstrating its efficiency and practicality.
中文: Neuro-MSBG是一种轻量级端到端模型,通过个性化听力图编码器将模拟运行时间减少46倍,同时保持高清晰度和语音质量,有效解决了现有听力损失模拟模型计算复杂度过高的问题。
English: Neuro-MSBG is a lightweight end-to-end model with a personalized audiogram encoder that significantly reduces simulation runtime by 46 times while maintaining high intelligibility and speech quality, addressing the computational limitations of existing hearing loss simulation models.
Authors:Yi Zhang, An Zhang, XiuYu Zhang, Leheng Sheng, Yuxin Chen, Zhenkai Liang, Xiang Wang
Abstract:
Large language models (LLMs), despite possessing latent safety understanding from their vast pretraining data, remain vulnerable to generating harmful content and exhibit issues such as over-refusal and utility degradation after safety alignment. Current safety alignment methods often result in superficial refusal shortcuts or rely on intensive supervision for reasoning-based approaches, failing to fully leverage the model's intrinsic safety self-awareness. We propose \textbf{AlphaAlign}, a simple yet effective pure reinforcement learning (RL) framework with verifiable safety reward designed to incentivize this latent safety awareness through proactive safety reasoning.} AlphaAlign employs a dual-reward system: a verifiable safety reward encourages correctly formatted and explicitly justified refusals for harmful queries while penalizing over-refusals, and a normalized helpfulness reward guides high-quality responses to benign inputs. This allows the model to develop proactive safety reasoning capabilities without depending on supervised safety-specific reasoning data. AlphaAlign demonstrates three key advantages: (1) Simplicity and efficiency, requiring only binary prompt safety labels and minimal RL steps for substantial improvements. (2) Breaking the safety-utility trade-off, by enhancing refusal of harmful content and reducing over-refusals, while simultaneously maintaining or even improving general task performance and robustness to unseen jailbreaks. (3) Deep alignment, fostering proactive safety reasoning that generates explicit safety rationales rather than relying on shallow refusal patterns.
中文:AlphaAlign是一种强化学习框架,通过双重奖励机制激发大语言模型的内在安全认知,促使其主动进行安全推理,从而在减少过度拒绝的同时提升有害内容拦截能力,无需依赖大量监督数据。
English: AlphaAlign is a reinforcement learning framework that enhances LLMs' safety by using dual rewards to encourage explicit safety reasoning and reduce over-refusals, improving both security and utility without extensive supervision.
Authors:Shuliang Liu, Qi Zheng, Jesse Jiaxi Xu, Yibo Yan, Junyan Zhang, He Geng, Aiwei Liu, Peijie Jiang, Jia Liu, Yik-Cheung Tam, Xuming Hu
Abstract:
Vision-language models demand watermarking solutions that protect intellectual property without compromising multimodal coherence. Existing text watermarking methods disrupt visual-textual alignment through biased token selection and static strategies, leaving semantic-critical concepts vulnerable. We propose VLA-Mark, a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination. Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns, to guide watermark injection without model retraining. An entropy-sensitive mechanism dynamically balances watermark strength and semantic preservation, prioritizing visual grounding during low-uncertainty generation phases. Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near-perfect detection (98.8% AUC). The framework demonstrates 96.1\% attack resilience against attacks such as paraphrasing and synonym substitution, while maintaining text-visual consistency, establishing new standards for quality-preserving multimodal watermarking
Chinese: VLA-Mark是一种视觉对齐水印框架,通过跨模态协调嵌入可检测水印并保持语义保真度,相比现有方法在检测精度、抗攻击能力和文本-视觉一致性方面均表现出更优异的性能。
English: VLA-Mark is a vision-aligned watermarking framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination, achieving superior performance in detection accuracy, attack resilience, and text-visual consistency compared to existing methods.
Authors:Yonghua Hei, Yibo Yan, Shuliang Liu, Huiyu Zhou, Linfeng Zhang, Xuming Hu
Abstract:
End-to-end Large Speech Language Models~(\textbf{LSLMs}) demonstrate strong potential in response latency and speech comprehension capabilities, showcasing general intelligence across speech understanding tasks. However, the ability to follow speech instructions has not been fully realized due to the lack of datasets and heavily biased training tasks. Leveraging the rich ASR datasets, previous approaches have used Large Language Models~(\textbf{LLMs}) to continue the linguistic information of speech to construct speech instruction datasets. Yet, due to the gap between LLM-generated results and real human responses, the continuation methods further amplify these shortcomings. Given the high costs of collecting and annotating speech instruction datasets by humans, using speech synthesis to construct large-scale speech instruction datasets has become a balanced and robust alternative. Although modern Text-To-Speech~(\textbf{TTS}) models have achieved near-human-level synthesis quality, it is challenging to appropriately convert out-of-distribution text instruction to speech due to the limitations of the training data distribution in TTS models. To address this issue, we propose a query rewriting framework with multi-LLM knowledge fusion, employing multiple agents to annotate and validate the synthesized speech, making it possible to construct high-quality speech instruction datasets without relying on human annotation. Experiments show that this method can transform text instructions into distributions more suitable for TTS models for speech synthesis through zero-shot rewriting, increasing data usability from 72\% to 93\%. It also demonstrates unique advantages in rewriting tasks that require complex knowledge and context-related abilities.
中文: 端到端大型语音语言模型在响应延迟和语音理解方面潜力显著,但因缺乏数据集和训练偏差,其遵循语音指令的能力尚未完全实现;为此提出基于多LLM知识融合的查询重写框架,通过零样本重写将文本指令转换为更适合TTS模型合成的分布,使数据可用性从72%提升至93%,无需人工标注即可构建高质量语音指令数据集。
English: End-to-end large speech language models show promise in low latency and speech understanding but face challenges in following instructions due to dataset scarcity and training biases, leading to a proposed query rewriting framework with multi-LLM knowledge fusion that enhances TTS model compatibility and boosts data usability from 72% to 93% without human annotation.
Authors:Jierun Chen, Tiezheng Yu, Haoli Bai, Lewei Yao, Jiannan Wu, Kaican Li, Fei Mi, Chaofan Tao, Lei Zhu, Manyi Zhang, Xiaohui Li, Lu Hou, Lifeng Shang, Qun Liu
Abstract:
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This ``synergy dilemma'' highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.
大型视觉语言模型通过长思维链微调提升复杂推理能力,而强化学习则促进泛化与简洁性,但两者结合未能产生叠加效益,这种协同困境凸显了需要更自适应融合策略的必要性。
Large vision-language models benefit from long chain-of-thought fine-tuning for complex reasoning and reinforcement learning for generalization, yet combining these methods fails to produce additive gains, revealing a synergy dilemma that requires more adaptive integration strategies.
Authors:Zonghai Yao, Youxia Zhao, Avijit Mitra, David A. Levy, Emily Druhl, Jack Tsai, Hong Yu
Abstract:
Eviction is a significant yet understudied social determinants of health (SDoH), linked to housing instability, unemployment, and mental health. While eviction appears in unstructured electronic health records (EHRs), it is rarely coded in structured fields, limiting downstream applications. We introduce SynthEHR-Eviction, a scalable pipeline combining LLMs, human-in-the-loop annotation, and automated prompt optimization (APO) to extract eviction statuses from clinical notes. Using this pipeline, we created the largest public eviction-related SDoH dataset to date, comprising 14 fine-grained categories. Fine-tuned LLMs (e.g., Qwen2.5, LLaMA3) trained on SynthEHR-Eviction achieved Macro-F1 scores of 88.8% (eviction) and 90.3% (other SDoH) on human validated data, outperforming GPT-4o-APO (87.8%, 87.3%), GPT-4o-mini-APO (69.1%, 78.1%), and BioBERT (60.7%, 68.3%), while enabling cost-effective deployment across various model sizes. The pipeline reduces annotation effort by over 80%, accelerates dataset creation, enables scalable eviction detection, and generalizes to other information extraction tasks.
Chinese Summary: 本研究提出的SynthEHR-Eviction管道结合大语言模型和人工标注,能从临床记录中提取驱逐数据,创建了最大的公共驱逐相关健康数据集,在性能超越现有模型的同时将标注工作量减少了80%以上。
English Summary: This study introduces SynthEHR-Eviction, a scalable pipeline combining LLMs and human annotation to extract eviction data from clinical notes, creating the largest public eviction-related health dataset and demonstrating superior performance over existing models while reducing annotation effort by over 80%.
Authors:Hieu Tran, Zonghai Yao, Won Seok Jang, Sharmin Sultana, Allen Chang, Yuan Zhang, Hong Yu
Abstract:
Generative AI has demonstrated strong potential in healthcare, from clinical decision support to patient-facing chatbots that improve outcomes. A critical challenge for deployment is effective human-AI communication, where content must be both personalized and understandable. We introduce MedReadCtrl, a readability-controlled instruction tuning framework that enables LLMs to adjust output complexity without compromising meaning. Evaluations of nine datasets and three tasks across medical and general domains show that MedReadCtrl achieves significantly lower readability instruction-following errors than GPT-4 (e.g., 1.39 vs. 1.59 on ReadMe, p<0.001) and delivers substantial gains on unseen clinical tasks (e.g., +14.7 ROUGE-L, +6.18 SARI on MTSamples). Experts consistently preferred MedReadCtrl (71.7% vs. 23.3%), especially at low literacy levels. These gains reflect MedReadCtrl's ability to restructure clinical content into accessible, readability-aligned language while preserving medical intent, offering a scalable solution to support patient education and expand equitable access to AI-enabled care.
中文: MedReadCtrl是一种新颖的框架,通过在不失原意的情况下调整可读性,使大语言模型能够生成个性化和易懂的医疗内容,在评估和专家偏好中显著优于GPT-4,有助于提升医疗服务的公平性。
English: MedReadCtrl is a novel framework that enables large language models to generate personalized and understandable medical content by adjusting readability without losing meaning, significantly outperforming GPT-4 in evaluations and expert preferences for improving equitable healthcare access.
Authors:Zhiwei Zhang, Hui Liu, Xiaomin Li, Zhenwei Dai, Jingying Zeng, Fali Wang, Minhua Lin, Ramraj Chandradevan, Zhen Li, Chen Luo, Xianfeng Tang, Qi He, Suhang Wang
Abstract:
Reward models trained on human preference data have demonstrated strong effectiveness in aligning Large Language Models (LLMs) with human intent under the framework of Reinforcement Learning from Human Feedback (RLHF). However, RLHF remains vulnerable to reward hacking, where the policy exploits imperfections in the reward function rather than genuinely learning the intended behavior. Although significant efforts have been made to mitigate reward hacking, they predominantly focus on and evaluate in-distribution scenarios, where the training and testing data for the reward model share the same distribution. In this paper, we empirically show that state-of-the-art methods struggle in more challenging out-of-distribution (OOD) settings. We further demonstrate that incorporating fine-grained multi-attribute scores helps address this challenge. However, the limited availability of high-quality data often leads to weak performance of multi-objective reward functions, which can negatively impact overall performance and become the bottleneck. To address this issue, we propose a unified reward modeling framework that jointly trains Bradley--Terry (BT) single-objective and multi-objective regression-based reward functions using a shared embedding space. We theoretically establish a connection between the BT loss and the regression objective and highlight their complementary benefits. Specifically, the regression task enhances the single-objective reward function's ability to mitigate reward hacking in challenging OOD settings, while BT-based training improves the scoring capability of the multi-objective reward function, enabling a 7B model to outperform a 70B baseline. Extensive experimental results demonstrate that our framework significantly improves both the robustness and the scoring performance of reward models.
中文摘要:基于人类偏好的奖励模型在RLHF框架下易受奖励破解影响且分布外泛化能力不足,而结合Bradley-Terry与回归目标的新统一框架通过优势互补,显著提升了奖励模型的鲁棒性和评分性能。
English Summary: Reward models in RLHF are prone to reward hacking and struggle with out-of-distribution generalization, but a new unified framework combining Bradley-Terry and regression-based objectives enhances robustness and performance by leveraging their complementary strengths.
Authors:Shuliang Liu, Hongyi Liu, Aiwei Liu, Bingchen Duan, Qi Zheng, Yibo Yan, He Geng, Peijie Jiang, Jia Liu, Xuming Hu
Abstract:
The widespread deployment of large language models (LLMs) across critical domains has amplified the societal risks posed by algorithmically generated misinformation. Unlike traditional false content, LLM-generated misinformation can be self-reinforcing, highly plausible, and capable of rapid propagation across multiple languages, which traditional detection methods fail to mitigate effectively. This paper introduces a proactive defense paradigm, shifting from passive post hoc detection to anticipatory mitigation strategies. We propose a Three Pillars framework: (1) Knowledge Credibility, fortifying the integrity of training and deployed data; (2) Inference Reliability, embedding self-corrective mechanisms during reasoning; and (3) Input Robustness, enhancing the resilience of model interfaces against adversarial attacks. Through a comprehensive survey of existing techniques and a comparative meta-analysis, we demonstrate that proactive defense strategies offer up to 63\% improvement over conventional methods in misinformation prevention, despite non-trivial computational overhead and generalization challenges. We argue that future research should focus on co-designing robust knowledge foundations, reasoning certification, and attack-resistant interfaces to ensure LLMs can effectively counter misinformation across varied domains.
中文: 本文提出了一种主动防御框架,包含知识可信性、推理可靠性和输入鲁棒性三大支柱,以应对大语言模型生成虚假信息的加剧风险,相比传统方法在预防效果上提升了63%,尽管存在计算和泛化方面的挑战。
English: This paper proposes a proactive defense framework with three pillars—Knowledge Credibility, Inference Reliability, and Input Robustness—to counter the heightened risks of LLM-generated misinformation, demonstrating a 63% improvement over traditional methods despite computational and generalization challenges.
Authors:Linfeng Ye, Shayan Mohajer Hamidi, Guang Li, Takahiro Ogawa, Miki Haseyama, Konstantinos N. Plataniotis
Abstract:
Dataset distillation aims to create a compact dataset that retains essential information while maintaining model performance. Diffusion models (DMs) have shown promise for this task but struggle in low images-per-class (IPC) settings, where generated samples lack diversity. In this paper, we address this issue from an information-theoretic perspective by identifying two key types of information that a distilled dataset must preserve: ($i$) prototype information $\mathrm{I}(X;Y)$, which captures label-relevant features; and ($ii$) contextual information $\mathrm{H}(X | Y)$, which preserves intra-class variability. Here, $(X,Y)$ represents the pair of random variables corresponding to the input data and its ground truth label, respectively. Observing that the required contextual information scales with IPC, we propose maximizing $\mathrm{I}(X;Y) + β\mathrm{H}(X | Y)$ during the DM sampling process, where $β$ is IPC-dependent. Since directly computing $\mathrm{I}(X;Y)$ and $\mathrm{H}(X | Y)$ is intractable, we develop variational estimations to tightly lower-bound these quantities via a data-driven approach. Our approach, information-guided diffusion sampling (IGDS), seamlessly integrates with diffusion models and improves dataset distillation across all IPC settings. Experiments on Tiny ImageNet and ImageNet subsets show that IGDS significantly outperforms existing methods, particularly in low-IPC regimes. The code will be released upon acceptance.
中文: 本文提出信息引导扩散采样方法,通过变分估计同时保留原型信息和上下文信息,显著提升了数据集蒸馏效果,在低图像类别样本场景中表现尤为突出。
English: This paper introduces Information-Guided Diffusion Sampling (IGDS), a method that enhances dataset distillation by preserving both prototype and contextual information through variational estimation, significantly outperforming existing techniques especially in low image-per-class settings.
Authors:Weixing Chen, Dafeng Chi, Yang Liu, Yuxi Yang, Yexin Zhang, Yuzheng Zhuang, Xingyue Quan, Jianye Hao, Guanbin Li, Liang Lin
Abstract:
The automated generation of layouts is vital for embodied intelligence and autonomous systems, supporting applications from virtual environment construction to home robot deployment. Current approaches, however, suffer from spatial hallucination and struggle with balancing semantic fidelity and physical plausibility, often producing layouts with deficits such as floating or overlapping objects and misaligned stacking relation. In this paper, we propose AutoLayout, a fully automated method that integrates a closed-loop self-validation process within a dual-system framework. Specifically, a slow system harnesses detailed reasoning with a Reasoning-Reflection-Generation (RRG) pipeline to extract object attributes and spatial constraints. Then, a fast system generates discrete coordinate sets and a topological relation set that are jointly validated. To mitigate the limitations of handcrafted rules, we further introduce an LLM-based Adaptive Relation Library (ARL) for generating and evaluating layouts. Through the implementation of Slow-Fast Collaborative Reasoning, the AutoLayout efficiently generates layouts after thorough deliberation, effectively mitigating spatial hallucination. Its self-validation mechanism establishes a closed-loop process that iteratively corrects potential errors, achieving a balance between physical stability and semantic consistency. The effectiveness of AutoLayout was validated across 8 distinct scenarios, where it demonstrated a significant 10.1% improvement over SOTA methods in terms of physical plausibility, semantic consistency, and functional completeness.
中文摘要:AutoLayout是一种采用双系统框架的自动布局方法,通过慢-快协同推理和自验证机制有效减少空间幻觉,在物理合理性、语义一致性和功能完整性方面比现有最优方法提升了10.1%。
English Summary: AutoLayout is an automated method that uses a dual-system framework with slow-fast collaborative reasoning and a self-validation process to effectively mitigate spatial hallucination, achieving a 10.1% improvement over state-of-the-art methods in layout generation.
Authors:Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, Kam-Fai Wong
Abstract:
Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning tasks have remained unexplored due to the dearth of unsolvable math problems. To systematically investigate LLM reliability in mathematical reasoning tasks, we formulate the reliability evaluation for both solvable and unsolvable problems. We then develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems synthesized by our proposed construction workflow with human evaluations. Experiments are conducted on various LLMs with several key findings uncovered. LLMs fail to directly identify unsolvable problems and always generate fabricated responses. When instructing LLMs to indicate unsolvability using a reliable prompt, the reliability of larger-sized LLMs remains on solvable problems, but notably improves on unsolvable problems yet still falls short of solvable problems. However, small LLMs rarely show any progress despite employing reliable prompts. Therefore, we further propose an alignment strategy to enhance small LLMs' reliability, which can significantly improve LLM reliability performances on both in-domain and out-of-domain tasks.
中文: 大型语言模型常对无解数学题给出不可靠答案,但通过新数据集和对齐策略可显著提升其可靠性,尤其对小型模型效果明显。
English: Large Language Models often generate unreliable answers to unsolvable math problems, but a new dataset and alignment strategy can significantly improve their reliability, especially for smaller models.
Authors:Ningyuan Li, Zhilin Zhang, Tianyan Long, Yuyao Liu, Rongquan Bai, Yurong Chen, Xiaotie Deng, Pengjie Wang, Chuan Yu, Jian Xu, Bo Zheng
Abstract:
On e-commerce platforms, sellers typically bid for impressions from ad traffic to promote their products. However, for most sellers, the majority of their sales come from organic traffic. Consequently, the relationship between their ad spending and total sales remains uncertain, resulting in operational inefficiency. To address this issue, e-commerce platforms have recently introduced a novel platform-wide marketing service known as QuanZhanTui, which has reportedly enhanced marketing efficiency for sellers and driven substantial revenue growth for platforms. QuanZhanTui allows sellers to bid for impressions from the platform's entire traffic to boost their total sales without compromising the platform's user experience. In this paper, we investigate the mechanism design problem that arises from QuanZhanTui. The problem is formulated as a multi-objective optimization to balance sellers' welfare and platform's user experience. We first introduce the stock-constrained value maximizer model, which reflects sellers' dual requirements on marketing efficiency and platform-wide ROI. Then, we propose the Liquid Payment Auction (LPA), an auction designed to optimize the balanced objectives while accounting for sellers' requirements in the auto-bidding environment. It employs a simple payment rule based on sellers' liquid welfare, providing a clearer link between their investment and total sales. Under mild assumptions, we theoretically prove desirable properties of LPA, such as optimality and incentive compatibility. Extensive experiments demonstrate LPA's superior performance over conventional auctions in QuanZhanTui.
中文摘要:电商平台推出全站推广服务,让卖家通过竞价全流量曝光提升总销售额,同时采用基于流动福利的拍卖机制平衡卖家利益与平台用户体验。
English Summary: E-commerce platforms have introduced QuanZhanTui, a marketing service enabling sellers to bid for impressions across all traffic to boost total sales while balancing seller welfare and user experience through the proposed Liquid Payment Auction.
Authors:Bo Wang, Pengyang Wang, Chong Chen, Qi Sun, Jieke Shi, Chengran Yang, Ming Deng, Youfang Lin, Zhou Yang, David Lo
Abstract:
Mutation-based fuzzing is effective for uncovering compiler bugs, but designing high-quality mutators for modern languages with complex constructs (e.g., templates, macros) remains challenging. Existing methods rely heavily on manual design or human-in-the-loop correction, limiting scalability and cross-language generalizability.
We present Mut4All, a fully automated, language-agnostic framework that synthesizes mutators using Large Language Models (LLMs) and compiler-specific knowledge from bug reports. It consists of three agents: (1) a mutator invention agent that identifies mutation targets and generates mutator metadata using compiler-related insights; (2) a mutator implementation synthesis agent, fine-tuned to produce initial implementations; and (3) a mutator refinement agent that verifies and corrects the mutators via unit-test feedback.
Mut4All processes 1000 bug reports (500 Rust, 500 C++), yielding 319 Rust and 403 C++ mutators at ~$0.08 each via GPT-4o. Our customized fuzzer, using these mutators, finds 62 bugs in Rust compilers (38 new, 7 fixed) and 34 bugs in C++ compilers (16 new, 1 fixed). Mut4All outperforms existing methods in both unique crash detection and coverage, ranking first on Rust and second on C++.
中文: Mut4All是一种自动化、语言无关的框架,利用大语言模型和编译器错误报告中的知识来合成变异器,以极低成本在Rust和C++编译器中实现了高效的错误检测。
English: Mut4All is an automated, language-agnostic framework that uses LLMs and compiler insights from bug reports to synthesize mutators, achieving high bug detection rates in Rust and C++ compilers at minimal cost.
Authors:Jianming Chang, Jieke Shi, Yunbo Lyu, Xin Zhou, Lulu Wang, Zhou Yang, Bixin Li, David Lo
Abstract:
Static program slicing, which extracts the executable portions of a program that affect the values at a specific location, supports many software analysis tasks such as debugging and security auditing. However, traditional slicing tools rely on computationally expensive reachability analysis over dependency graphs, which struggle to scale to large programs and often fail to handle code with incomplete syntax. Recently emerged learning-based methods, while more robust to such cases, still fall short of achieving comparable performance to traditional methods on well-formed code.
In this work, we propose SliceMate, a novel static program slicing solution powered by Large Language Model (LLM) agents. It bypasses the need for explicit dependency graph construction and achieving superior slicing accuracy. Concretely, SliceMate integrates three specialized agents: (1) a synthesis agent that produces candidate slices by incrementally expanding the scan scope across functions and files guided by LLM-inferred dependencies; (2) a verification agent that performs conciseness and completeness checks of the candidate slices, detecting missing or irrelevant statements; and (3) a refinement agent that repairs the slices with minimal edits in accordance with the verification results. These agents are orchestrated by a control module that ensures timely convergence and outputs high-quality slices without manual intervention. For rigorous evaluation, we construct a new and high-quality benchmark, SliceBench, comprising 2,200 manually annotated Java and Python programs, with program lengths ranging from 5 to 8,577 lines, significantly larger than those in existing slicing benchmarks. Experimental results show that SliceMate greatly outperforms both traditional and learning-based slicing tools.
中文摘要:SliceMate是一种基于大语言模型代理的新型静态程序切片方法,通过合成、验证和优化三个专业代理绕过依赖图构建,在显著超越传统和基于学习的切片工具的同时实现了更高的准确性。
English Summary: SliceMate is a novel static program slicing method using LLM agents to bypass dependency graph construction, achieving superior accuracy through three specialized agents for synthesis, verification, and refinement of program slices.
Authors:Ivan Medennikov, Taejin Park, Weiqing Wang, He Huang, Kunal Dhawan, Jinhan Wang, Jagadeesh Balam, Boris Ginsburg
Abstract:
This paper presents a streaming extension for the Sortformer speaker diarization framework, whose key property is the arrival-time ordering of output speakers. The proposed approach employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers. Unlike conventional speaker-tracing buffers, AOSC orders embeddings by speaker index corresponding to their arrival time order, and is dynamically updated by selecting frames with the highest scores based on the model's past predictions. Notably, the number of stored embeddings per speaker is determined dynamically by the update mechanism, ensuring efficient cache utilization and precise speaker tracking. Experiments on benchmark datasets confirm the effectiveness and flexibility of our approach, even in low-latency setups. These results establish Streaming Sortformer as a robust solution for real-time multi-speaker tracking and a foundation for streaming multi-talker speech processing.
中文摘要:本文提出了Sortformer的流式扩展方法,采用到达顺序说话人缓存机制动态跟踪按到达时间排序的说话人,在低延迟设置下仍保持高效性能。
English Summary: This paper introduces a streaming extension for Sortformer that uses an Arrival-Order Speaker Cache to dynamically track speakers in real-time based on arrival order, demonstrating effective performance in low-latency scenarios.
Authors:Bo Fang, Jianan Fan, Dongnan Liu, Hang Chang, Gerald J. Shami, Filip Braet, Weidong Cai
Abstract:
The significant morphological and distributional variability among subcellular components poses a long-standing challenge for learning-based organelle segmentation models, significantly increasing the risk of biased feature learning. Existing methods often rely on single mapping relationships, overlooking feature diversity and thereby inducing biased training. Although the Segment Anything Model (SAM) provides rich feature representations, its application to subcellular scenarios is hindered by two key challenges: (1) The variability in subcellular morphology and distribution creates gaps in the label space, leading the model to learn spurious or biased features. (2) SAM focuses on global contextual understanding and often ignores fine-grained spatial details, making it challenging to capture subtle structural alterations and cope with skewed data distributions. To address these challenges, we introduce ScSAM, a method that enhances feature robustness by fusing pre-trained SAM with Masked Autoencoder (MAE)-guided cellular prior knowledge to alleviate training bias from data imbalance. Specifically, we design a feature alignment and fusion module to align pre-trained embeddings to the same feature space and efficiently combine different representations. Moreover, we present a cosine similarity matrix-based class prompt encoder to activate class-specific features to recognize subcellular categories. Extensive experiments on diverse subcellular image datasets demonstrate that ScSAM outperforms state-of-the-art methods.
中文摘要:提出的ScSAM方法通过融合SAM特征与MAE引导的细胞先验知识,解决了细胞器分割中因形态多样性和数据不平衡导致的训练偏差问题,借助特征对齐和类别提示编码器实现了优越的分割性能。
English Summary: The proposed ScSAM method integrates SAM's features with MAE-guided cellular priors to overcome biases in organelle segmentation caused by morphological diversity and data imbalance, achieving superior performance through feature alignment and class-specific activation.
Authors:Antoni Kowalczuk, Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch
Abstract:
Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering replication, based on the assumption that memorization can be localized. Our research assesses the robustness of these pruning-based approaches. We demonstrate that even after pruning, minor adjustments to text embeddings of input prompts are sufficient to re-trigger data replication, highlighting the fragility of these defenses. Furthermore, we challenge the fundamental assumption of memorization locality, by showing that replication can be triggered from diverse locations within the text embedding space, and follows different paths in the model. Our findings indicate that existing mitigation strategies are insufficient and underscore the need for methods that truly remove memorized content, rather than attempting to suppress its retrieval. As a first step in this direction, we introduce a novel adversarial fine-tuning method that iteratively searches for replication triggers and updates the model to increase robustness. Through our research, we provide fresh insights into the nature of memorization in text-to-image DMs and a foundation for building more trustworthy and compliant generative AI.
中文: 本研究揭示了当前基于剪枝的防御方法在文生图扩散模型中效果有限,数据复制可通过细微提示调整被重新触发,并提出一种对抗性微调方法,通过从根源处理记忆问题来增强模型鲁棒性。
English: This study reveals that current pruning-based defenses in text-to-image diffusion models are ineffective, as data replication can be easily reactivated through minor prompt adjustments, and introduces an adversarial fine-tuning method to enhance model robustness by addressing memorization at its root.
Authors:Zihao Hu, Jia Yan, Ying-Jun Angela Zhang, Jun Zhang, Khaled B. Letaief
Abstract:
The rapid proliferation and growth of artificial intelligence (AI) has led to the development of federated learning (FL). FL allows wireless devices (WDs) to cooperatively learn by sharing only local model parameters, without needing to share the entire dataset. However, the emergence of large AI models has made existing FL approaches inefficient, due to the significant communication overhead required. In this paper, we propose a novel over-the-air federated distillation (FD) framework by synergizing the strength of FL and knowledge distillation to avoid the heavy local model transmission. Instead of sharing the model parameters, only the WDs' model outputs, referred to as knowledge, are shared and aggregated over-the-air by exploiting the superposition property of the multiple-access channel. We shall study the transceiver design in over-the-air FD, aiming to maximize the learning convergence rate while meeting the power constraints of the transceivers. The main challenge lies in the intractability of the learning performance analysis, as well as the non-convex nature and the optimization spanning the whole FD training period. To tackle this problem, we first derive an analytical expression of the convergence rate in over-the-air FD. Then, the closed-form optimal solutions of the WDs' transmit power and the estimator for over-the-air aggregation are obtained given the receiver combining strategy. Accordingly, we put forth an efficient approach to find the optimal receiver beamforming vector via semidefinite relaxation. We further prove that there is no optimality gap between the original and relaxed problem for the receiver beamforming design. Numerical results will show that the proposed over-the-air FD approach achieves a significant reduction in communication overhead, with only a minor compromise in testing accuracy compared to conventional FL benchmarks.
中文摘要:本文提出了一种新型空中联邦蒸馏框架,通过共享模型输出而非参数来降低通信开销,并设计了最优收发器方案以在功率限制下最大化学习收敛速度。
English Summary: The paper introduces an over-the-air federated distillation framework that reduces communication overhead by sharing only model outputs instead of parameters, and proposes optimal transceiver designs to maximize learning convergence under power constraints.
Authors:Fei Liu, Rui Zhang, Xi Lin, Zhichao Lu, Qingfu Zhang
Abstract:
The integration of large language models (LLMs) into automated algorithm design has shown promising potential. A prevalent approach embeds LLMs within search routines to iteratively generate and refine candidate algorithms. However, most existing methods rely on off-the-shelf LLMs trained for general coding tasks,leaving a key question open: Do we need LLMs specifically tailored for algorithm design? If so, how can such LLMs be effectively obtained and how well can they generalize across different algorithm design tasks? In this paper, we take a first step toward answering these questions by exploring fine-tuning of LLMs for algorithm design. We introduce a Diversity-Aware Rank based (DAR) sampling strategy to balance training data diversity and quality, then we leverage direct preference optimization to efficiently align LLM outputs with task objectives. Our experiments, conducted on Llama-3.2-1B-Instruct and Llama- 3.1-8B-Instruct, span three distinct algorithm design tasks. Results suggest that finetuned LLMs can significantly outperform their off-the-shelf counterparts with the smaller Llama-3.2-1B-Instruct and match the larger Llama-3.1-8B-Instruct on the admissible set problem. Moreover, we observe promising generalization: LLMs finetuned on specific algorithm design tasks also improve performance on related tasks with varying settings. These findings highlight the value of task-specific adaptation for LLMs in algorithm design and open new avenues for future research.
中文: 研究表明,专门针对算法设计进行微调的大语言模型性能显著提升,其中较小的Llama-3.2-1B-Instruct不仅超越通用模型,在特定任务上甚至能与更大模型匹敌,同时在相关算法设计任务中展现出良好的泛化能力。
English: This study demonstrates that fine-tuning large language models specifically for algorithm design significantly enhances their performance, with the smaller Llama-3.2-1B-Instruct outperforming standard models and matching larger counterparts in certain tasks, while also showing promising generalization across related algorithm design challenges.
Authors:Hong Niu, Jiancheng An, Tuo Wu, Jiangong Chen, Yufei Zhao, Yong Liang Guan, Marco Di Renzo, Merouane Debbah, George K. Karagiannidis, H. Vincent Poor, Chau Yuen
Abstract:
Stacked intelligent metasurfaces (SIMs), which integrate multiple programmable metasurface layers, have recently emerged as a promising technology for advanced wave-domain signal processing. SIMs benefit from flexible spatial degree-of-freedom (DoF) while reducing the requirement for costly radio-frequency (RF) chains. However, current state-of-the-art SIM designs face challenges such as complex phase shift optimization and energy attenuation from multiple layers. To address these aspects, we propose incorporating meta-fibers into SIMs, with the aim of reducing the number of layers and enhancing the energy efficiency. First, we introduce a meta-fiber-connected 2-layer SIM that exhibits the same flexible signal processing capabilities as conventional multi-layer structures, and explains the operating principle. Subsequently, we formulate and solve the optimization problem of minimizing the mean square error (MSE) between the SIM channel and the desired channel matrices. Specifically, by designing the phase shifts of the meta-atoms associated with the transmitting-SIM and receiving-SIM, a non-interference system with parallel subchannels is established. In order to reduce the computational complexity, a closed-form expression for each phase shift at each iteration of an alternating optimization (AO) algorithm is proposed. We show that the proposed algorithm is applicable to conventional multi-layer SIMs. The channel capacity bound and computational complexity are analyzed to provide design insights. Finally, numerical results are illustrated, demonstrating that the proposed two-layer SIM with meta-fiber achieves over a 25% improvement in channel capacity while reducing the total number of meta-atoms by 59% as compared with a conventional seven-layer SIM.
中文: 提出的集成超光纤双层智能超表面在降低复杂度和元件数量的同时,相比传统多层设计显著提升了能效和信道容量。
English: The proposed meta-fiber-integrated two-layer SIM enhances energy efficiency and channel capacity while reducing complexity and component count compared to conventional multi-layer designs.
Authors:Artur Muratov, Hana Fatima Shaikh, Vanshikaa Jani, Tarek Mahmoud, Zhuohan Xie, Daniil Orel, Aaryamonvikram Singh, Yuxia Wang, Aadi Joshi, Hasan Iqbal, Ming Shan Hee, Dhruv Sahnan, Nikolaos Nikolaidis, Purificação Silvano, Dimitar Dimitrov, Roman Yangarber, Ricardo Campos, AlÃpio Jorge, Nuno Guimarães, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov
Abstract:
We present FRaN-X, a Framing and Narratives Explorer that automatically detects entity mentions and classifies their narrative roles directly from raw text. FRaN-X comprises a two-stage system that combines sequence labeling with fine-grained role classification to reveal how entities are portrayed as protagonists, antagonists, or innocents, using a unique taxonomy of 22 fine-grained roles nested under these three main categories. The system supports five languages (Bulgarian, English, Hindi, Russian, and Portuguese) and two domains (the Russia-Ukraine Conflict and Climate Change). It provides an interactive web interface for media analysts to explore and compare framing across different sources, tackling the challenge of automatically detecting and labeling how entities are framed. Our system allows end users to focus on a single article as well as analyze up to four articles simultaneously. We provide aggregate level analysis including an intuitive graph visualization that highlights the narrative a group of articles are pushing. Our system includes a search feature for users to look up entities of interest, along with a timeline view that allows analysts to track an entity's role transitions across different contexts within the article. The FRaN-X system and the trained models are licensed under an MIT License. FRaN-X is publicly accessible at https://fran-x.streamlit.app/ and a video demonstration is available at https://youtu.be/VZVi-1B6yYk.
中文: FRaN-X 是一个多语言系统,能自动识别文本中的实体并将其分类为细粒度叙事角色,为媒体分析师提供交互式网页界面,以便探索和比较不同文章的框架。
English: FRaN-X is a multilingual system that automatically identifies entity mentions in text and classifies them into fine-grained narrative roles, offering an interactive web interface for media analysts to explore and compare framing across articles.
Authors:Matthias Zeller, Daniel Casado Herraez, Bengisu Ayan, Jens Behley, Michael Heidingsfeld, Cyrill Stachniss
Abstract:
Semantic scene understanding, including the perception and classification of moving agents, is essential to enabling safe and robust driving behaviours of autonomous vehicles. Cameras and LiDARs are commonly used for semantic scene understanding. However, both sensor modalities face limitations in adverse weather and usually do not provide motion information. Radar sensors overcome these limitations and directly offer information about moving agents by measuring the Doppler velocity, but the measurements are comparably sparse and noisy. In this paper, we address the problem of panoptic segmentation in sparse radar point clouds to enhance scene understanding. Our approach, called SemRaFiner, accounts for changing density in sparse radar point clouds and optimizes the feature extraction to improve accuracy. Furthermore, we propose an optimized training procedure to refine instance assignments by incorporating a dedicated data augmentation. Our experiments suggest that our approach outperforms state-of-the-art methods for radar-based panoptic segmentation.
中文: 本文提出SemRaFiner方法,通过处理稀疏雷达点云中的密度变化并优化特征提取,实现了雷达全景分割以增强自动驾驶的语义场景理解,其性能优于现有雷达技术。
English: This paper introduces SemRaFiner, a method for panoptic segmentation in sparse radar point clouds that addresses density variations and enhances feature extraction to improve semantic scene understanding for autonomous vehicles, outperforming existing radar-based approaches.
Authors:Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu
Abstract:
Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis reveals that this stability is not primarily due to explicit mechanisms like KL penalty or chain-of-thought reasoning. Instead, we identify an implicit regularization mechanism inherent to RFT as a key contributing factor. Our theoretical analysis suggests that RFT's gradient updates are naturally scaled by the reward variance, acting as a data-dependent regularizer that inherently protects previously acquired knowledge. Finally, we propose a rollout-based instance filtering algorithm to enhance the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.
中文: 研究表明,在持续后训练中,强化微调(RFT)通过内在的正则化机制能自然保留先前知识并维持通用能力,优于会导致灾难性遗忘的监督微调(SFT)。
English: This study demonstrates that reinforcement fine-tuning (RFT) outperforms supervised fine-tuning (SFT) in continual post-training by naturally preserving prior knowledge and maintaining general capabilities through an implicit regularization mechanism, while SFT suffers from catastrophic forgetting.
Authors:Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu
Abstract:
Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis reveals that this stability is not primarily due to explicit mechanisms like KL penalty or chain-of-thought reasoning. Instead, we identify an implicit regularization mechanism inherent to RFT as a key contributing factor. Our theoretical analysis suggests that RFT's gradient updates are naturally scaled by the reward variance, acting as a data-dependent regularizer that inherently protects previously acquired knowledge. Finally, we propose a rollout-based instance filtering algorithm to enhance the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.
中文: 研究表明,在持续后训练中,强化微调(RFT)通过内在的正则化机制能自然保留先前知识并维持通用能力,优于会导致灾难性遗忘的监督微调(SFT)。
English: This study demonstrates that reinforcement fine-tuning (RFT) outperforms supervised fine-tuning (SFT) in continual post-training by naturally preserving prior knowledge and maintaining general capabilities through an implicit regularization mechanism, while SFT suffers from catastrophic forgetting.
Authors:Xixi Wan, Aihua Zheng, Bo Jiang, Beibei Wang, Chenglong Li, Jin Tang
Abstract:
Multi-modal object Re-IDentification (ReID) has gained considerable attention with the goal of retrieving specific targets across cameras using heterogeneous visual data sources. Existing methods primarily aim to improve identification performance, but often overlook the uncertainty arising from inherent defects, such as intra-modal noise and inter-modal conflicts. This uncertainty is particularly significant in the case of fine-grained local occlusion and frame loss, which becomes a challenge in multi-modal learning. To address the above challenge, we propose a robust approach named Uncertainty-Guided Graph model for multi-modal object ReID (UGG-ReID). UGG-ReID is designed to mitigate noise interference and facilitate effective multi-modal fusion by estimating both local and sample-level aleatoric uncertainty and explicitly modeling their dependencies. Specifically, we first propose the Gaussian patch-graph representation model that leverages uncertainty to quantify fine-grained local cues and capture their structural relationships. This process boosts the expressiveness of modal-specific information, ensuring that the generated embeddings are both more informative and robust. Subsequently, we design an uncertainty-guided mixture of experts strategy that dynamically routes samples to experts exhibiting low uncertainty. This strategy effectively suppresses noise-induced instability, leading to enhanced robustness. Meanwhile, we design an uncertainty-guided routing to strengthen the multi-modal interaction, improving the performance. UGG-ReID is comprehensively evaluated on five representative multi-modal object ReID datasets, encompassing diverse spectral modalities. Experimental results show that the proposed method achieves excellent performance on all datasets and is significantly better than current methods in terms of noise immunity. Our code will be made public upon acceptance.
中文: 提出的UGG-ReID模型通过不确定性引导的图结构和专家路由策略,有效解决了多模态目标重识别中的噪声和模态间冲突问题,在多种数据集上实现了卓越的鲁棒性和性能表现。
English: The proposed UGG-ReID model addresses uncertainty from noise and inter-modal conflicts in multi-modal object ReID by using uncertainty-guided graphs and expert routing, achieving superior robustness and performance across diverse datasets.
Authors:Matthias Zeller, Vardeep S. Sandhu, Benedikt Mersch, Jens Behley, Michael Heidingsfeld, Cyrill Stachniss
Abstract:
The awareness about moving objects in the surroundings of a self-driving vehicle is essential for safe and reliable autonomous navigation. The interpretation of LiDAR and camera data achieves exceptional results but typically requires to accumulate and process temporal sequences of data in order to extract motion information. In contrast, radar sensors, which are already installed in most recent vehicles, can overcome this limitation as they directly provide the Doppler velocity of the detections and, hence incorporate instantaneous motion information within a single measurement. % In this paper, we tackle the problem of moving object segmentation in noisy radar point clouds. We also consider differentiating parked from moving cars, to enhance scene understanding. Instead of exploiting temporal dependencies to identify moving objects, we develop a novel transformer-based approach to perform single-scan moving object segmentation in sparse radar scans accurately. The key to our Radar Velocity Transformer is to incorporate the valuable velocity information throughout each module of the network, thereby enabling the precise segmentation of moving and non-moving objects. Additionally, we propose a transformer-based upsampling, which enhances the performance by adaptively combining information and overcoming the limitation of interpolation of sparse point clouds. Finally, we create a new radar moving object segmentation benchmark based on the RadarScenes dataset and compare our approach to other state-of-the-art methods. Our network runs faster than the frame rate of the sensor and shows superior segmentation results using only single-scan radar data.
中文摘要:本文提出了一种基于Transformer的新方法,通过利用雷达多普勒速度信息实现单次扫描的移动物体分割,无需累积时序数据即可区分运动与静止物体,在实时性能上表现优异。
English Summary: This paper introduces a novel transformer-based method for single-scan moving object segmentation in radar point clouds, leveraging Doppler velocity to distinguish moving from stationary objects without temporal data accumulation, achieving superior real-time performance.
Authors:Matthias Zeller, Daniel Casado Herraez, Jens Behley, Michael Heidingsfeld, Cyrill Stachniss
Abstract:
Robots and autonomous vehicles should be aware of what happens in their surroundings. The segmentation and tracking of moving objects are essential for reliable path planning, including collision avoidance. We investigate this estimation task for vehicles using radar sensing. We address moving instance tracking in sparse radar point clouds to enhance scene interpretation. We propose a learning-based radar tracker incorporating temporal offset predictions to enable direct center-based association and enhance segmentation performance by including additional motion cues. We implement attention-based tracking for sparse radar scans to include appearance features and enhance performance. The final association combines geometric and appearance features to overcome the limitations of center-based tracking to associate instances reliably. Our approach shows an improved performance on the moving instance tracking benchmark of the RadarScenes dataset compared to the current state of the art.
Chinese: 本研究提出了一种基于学习的雷达跟踪器,结合时间偏移预测和注意力机制,以增强稀疏雷达点云中的移动实例追踪能力,并在RadarScenes数据集上实现了优于现有技术的性能。
English: This study introduces a learning-based radar tracker that integrates temporal offset predictions and attention mechanisms to improve moving instance tracking in sparse radar point clouds, achieving state-of-the-art performance on the RadarScenes benchmark.
Authors:Yutian Liu, Zhengyi Yang, Jiancan Wu, Xiang Wang
Abstract:
Recent advances have applied large language models (LLMs) to sequential recommendation, leveraging their pre-training knowledge and reasoning capabilities to provide more personalized user experiences. However, existing LLM-based methods fail to sufficiently leverage the rich temporal information inherent in users' historical interaction sequences, stemming from fundamental architectural constraints: LLMs process information through self-attention mechanisms that lack inherent sequence ordering and rely on position embeddings designed primarily for natural language rather than user interaction sequences. This limitation significantly impairs their ability to capture the evolution of user preferences over time and predict future interests accurately.
To address this critical gap, we propose \underline{C}ounterfactual \underline{E}nhanced \underline{T}emporal Framework for LLM-Based \underline{Rec}ommendation (CETRec). CETRec is grounded in causal inference principles, which allow it to isolate and measure the specific impact of temporal information on recommendation outcomes. Combined with our counterfactual tuning task derived from causal analysis, CETRec effectively enhances LLMs' awareness of both absolute order (how recently items were interacted with) and relative order (the sequential relationships between items). Extensive experiments on real-world datasets demonstrate the effectiveness of our CETRec. Our code is available at https://anonymous.4open.science/r/CETRec-B9CE/.
中文摘要:现有基于大语言模型的推荐系统难以有效利用用户交互中的时间信息,限制了追踪偏好演变的能力,而提出的CETRec框架通过因果推理增强时间感知,解决了这一关键问题。
English Summary: Current LLM-based recommendation systems struggle to effectively utilize temporal data from user interactions, limiting their ability to track evolving preferences, but the proposed CETRec framework addresses this by applying causal inference to enhance temporal awareness in recommendations.
Authors:Yiming Li, Shuo Shao, Yu He, Junfeng Guo, Tianwei Zhang, Zhan Qin, Pin-Yu Chen, Michael Backes, Philip Torr, Dacheng Tao, Kui Ren
Abstract:
The (generative) artificial intelligence (AI) era has profoundly reshaped the meaning and value of data. No longer confined to static content, data now permeates every stage of the AI lifecycle from the training samples that shape model parameters to the prompts and outputs that drive real-world model deployment. This shift renders traditional notions of data protection insufficient, while the boundaries of what needs safeguarding remain poorly defined. Failing to safeguard data in AI systems can inflict societal and individual, underscoring the urgent need to clearly delineate the scope of and rigorously enforce data protection. In this perspective, we propose a four-level taxonomy, including non-usability, privacy preservation, traceability, and deletability, that captures the diverse protection needs arising in modern (generative) AI models and systems. Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline, including training datasets, model weights, system prompts, and AI-generated content. We analyze representative technical approaches at each level and reveal regulatory blind spots that leave critical assets exposed. By offering a structured lens to align future AI technologies and governance with trustworthy data practices, we underscore the urgency of rethinking data protection for modern AI techniques and provide timely guidance for developers, researchers, and regulators alike.
中文: 人工智能时代使数据成为动态资产,亟需新的保护框架;为此我们提出包含不可用性、隐私保护、可追溯性和可删除性的四级分类法,以应对AI全生命周期的风险,并为技术与治理提供可信赖的实践指引。
English: The AI era has transformed data into a dynamic asset requiring new protection frameworks, prompting the proposal of a four-level taxonomy—non-usability, privacy preservation, traceability, and deletability—to address risks across the AI lifecycle and guide technology and governance toward trustworthy practices.
Authors:Khanh Son Pham, Christian Witte, Jens Behley, Johannes Betz, Cyrill Stachniss
Abstract:
Most autonomous cars rely on the availability of high-definition (HD) maps. Current research aims to address this constraint by directly predicting HD map elements from onboard sensors and reasoning about the relationships between the predicted map and traffic elements. Despite recent advancements, the coherent online construction of HD maps remains a challenging endeavor, as it necessitates modeling the high complexity of road topologies in a unified and consistent manner. To address this challenge, we propose a coherent approach to predict lane segments and their corresponding topology, as well as road boundaries, all by leveraging prior map information represented by commonly available standard-definition (SD) maps. We propose a network architecture, which leverages hybrid lane segment encodings comprising prior information and denoising techniques to enhance training stability and performance. Furthermore, we facilitate past frames for temporal consistency. Our experimental evaluation demonstrates that our approach outperforms previous methods by a large margin, highlighting the benefits of our modeling scheme.
中文: 本研究提出了一种利用先验标准地图数据、结合混合编码与时间一致性的方法,能够连贯预测高清地图中的车道段和道路边界元素,其性能远超现有技术。
English: This study introduces a coherent method for predicting HD map elements, including lane segments and road boundaries, by utilizing prior SD map data and incorporating hybrid encoding and temporal consistency techniques, which significantly outperforms existing approaches.
Authors:Gopika Sudhakaran, Hikaru Shindo, Patrick Schramowski, Simone Schaub-Meyer, Kristian Kersting, Stefan Roth
Abstract:
Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART's practical value by using the predicted relations for segmenting complex scenes.
中文: ART框架通过指令微调视觉语言模型,并采用自适应实例选择策略,显著提升了视觉关系检测的泛化能力,能够识别未知关系并分割复杂场景。
English: The ART framework enhances visual relation detection by instruction-tuning vision-language models on strategically selected data, enabling superior generalization to unseen relations and complex scene segmentation.
Authors:Fan Lyu, Linglan Zhao, Chengyan Liu, Yinying Mei, Zhang Zhang, Jian Zhang, Fuyuan Hu, Liang Wang
Abstract:
Few-Shot Class-Incremental Learning (FSCIL) focuses on models learning new concepts from limited data while retaining knowledge of previous classes. Recently, many studies have started to leverage unlabeled samples to assist models in learning from few-shot samples, giving rise to the field of Semi-supervised Few-shot Class-Incremental Learning (Semi-FSCIL). However, these studies often assume that the source of unlabeled data is only confined to novel classes of the current session, which presents a narrow perspective and cannot align well with practical scenarios. To better reflect real-world scenarios, we redefine Semi-FSCIL as Generalized Semi-FSCIL (GSemi-FSCIL) by incorporating both base and all the ever-seen novel classes in the unlabeled set. This change in the composition of unlabeled samples poses a new challenge for existing methods, as they struggle to distinguish between unlabeled samples from base and novel classes. To address this issue, we propose an Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy. ALDC dynamically uses abundant base samples to correct biased feature distributions for few-shot novel classes. Experiments on three benchmark datasets show that our method outperforms existing works, setting new state-of-the-art results.
Chinese: 本文重新定义了半监督少样本类增量学习(Semi-FSCIL)为广义半监督少样本类增量学习(GSemi-FSCIL),通过在未标记集中同时包含基类和新类以更好地匹配实际场景,并提出基于模糊度引导的可学习分布校准(ALDC)策略来修正特征分布偏差,在基准数据集上取得了最先进的性能。
English: This paper redefines Semi-supervised Few-shot Class-Incremental Learning (Semi-FSCIL) as Generalized Semi-FSCIL (GSemi-FSCIL) to better align with real-world scenarios by including both base and novel classes in the unlabeled set, and introduces the Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy to correct feature distribution biases, achieving state-of-the-art results on benchmark datasets.
Authors:Yaxin Xiao, Qingqing Ye, Li Hu, Huadi Zheng, Haibo Hu, Zi Liang, Haoyang Li, Yijie Jiao
Abstract:
Machine unlearning enables the removal of specific data from ML models to uphold the right to be forgotten. While approximate unlearning algorithms offer efficient alternatives to full retraining, this work reveals that they fail to adequately protect the privacy of unlearned data. In particular, these algorithms introduce implicit residuals which facilitate privacy attacks targeting at unlearned data. We observe that these residuals persist regardless of model architectures, parameters, and unlearning algorithms, exposing a new attack surface beyond conventional output-based leakage. Based on this insight, we propose the Reminiscence Attack (ReA), which amplifies the correlation between residuals and membership privacy through targeted fine-tuning processes. ReA achieves up to 1.90x and 1.12x higher accuracy than prior attacks when inferring class-wise and sample-wise membership, respectively. To mitigate such residual-induced privacy risk, we develop a dual-phase approximate unlearning framework that first eliminates deep-layer unlearned data traces and then enforces convergence stability to prevent models from "pseudo-convergence", where their outputs are similar to retrained models but still preserve unlearned residuals. Our framework works for both classification and generation tasks. Experimental evaluations confirm that our approach maintains high unlearning efficacy, while reducing the adaptive privacy attack accuracy to nearly random guess, at the computational cost of 2-12% of full retraining from scratch.
中文: 本研究揭示近似机器遗忘算法因残留隐性痕迹而无法充分保护已删除数据的隐私,提出了利用此漏洞的回忆攻击,并开发了一种双阶段遗忘框架,能在保持高效的同时将隐私攻击准确率降至接近随机猜测水平。
English: This study exposes that approximate machine unlearning algorithms inadequately protect unlearned data privacy due to persistent implicit residuals, proposes the Reminiscence Attack exploiting these vulnerabilities, and introduces a dual-phase unlearning framework that effectively mitigates privacy risks while maintaining efficiency.
Authors:Seonghoon Yoo, Houssem Sifaou, Sangwoo Park, Joonhyuk Kang, Osvaldo Simeone
Abstract:
Calibration data are necessary to formally quantify the uncertainty of the decisions produced by an existing artificial intelligence (AI) model. To overcome the common issue of scarce calibration data, a promising approach is to employ synthetic labels produced by a (generally different) predictive model. However, fine-tuning the label-generating predictor on the inference task of interest, as well as estimating the residual bias of the synthetic labels, demand additional data, potentially exacerbating the calibration data scarcity problem. This paper introduces a novel approach that efficiently utilizes limited calibration data to simultaneously fine-tune a predictor and estimate the bias of the synthetic labels. The proposed method yields prediction sets with rigorous coverage guarantees for AI-generated decisions. Experimental results on an indoor localization problem validate the effectiveness and performance gains of our solution.
中文摘要:本文提出一种新方法,能高效利用有限校准数据同时微调预测器并估计合成标签偏差,通过室内定位实验验证了所生成人工智能决策预测集具有严格覆盖保证的有效性。
English Summary: This paper introduces a novel method that efficiently uses limited calibration data to simultaneously fine-tune a predictor and estimate synthetic label bias, producing AI decision prediction sets with rigorous coverage guarantees validated through indoor localization experiments.
Authors:Muhammad Zaeem Shahzad, Muhammad Abdullah Hanif, Bassem Ouni, Muhammad Shafique
Abstract:
Advanced Driver Assistance Systems (ADAS) significantly enhance road safety by detecting potential collisions and alerting drivers. However, their reliance on expensive sensor technologies such as LiDAR and radar limits accessibility, particularly in low- and middle-income countries. Machine learning-based ADAS (ML-ADAS), leveraging deep neural networks (DNNs) with only standard camera input, offers a cost-effective alternative. Critical to ML-ADAS is the collision avoidance feature, which requires the ability to detect objects and estimate their distances accurately. This is achieved with specialized DNNs like YOLO, which provides real-time object detection, and a lightweight, detection-wise distance estimation approach that relies on key features extracted from the detections like bounding box dimensions and size. However, the robustness of these systems is undermined by security vulnerabilities in object detectors. In this paper, we introduce ShrinkBox, a novel backdoor attack targeting object detection in collision avoidance ML-ADAS. Unlike existing attacks that manipulate object class labels or presence, ShrinkBox subtly shrinks ground truth bounding boxes. This attack remains undetected in dataset inspections and standard benchmarks while severely disrupting downstream distance estimation. We demonstrate that ShrinkBox can be realized in the YOLOv9m object detector at an Attack Success Rate (ASR) of 96%, with only a 4% poisoning ratio in the training instances of the KITTI dataset. Furthermore, given the low error targets introduced in our relaxed poisoning strategy, we find that ShrinkBox increases the Mean Absolute Error (MAE) in downstream distance estimation by more than 3x on poisoned samples, potentially resulting in delays or prevention of collision warnings altogether.
中文: 基于机器学习的先进驾驶辅助系统提供了经济的安全方案,但易受ShrinkBox这种新型后门攻击,该攻击通过微妙缩小检测框来严重破坏距离估算和防撞功能,且难以被察觉。
English: Advanced Driver Assistance Systems using machine learning offer a cost-effective safety solution but are vulnerable to ShrinkBox, a novel backdoor attack that subtly shrinks object detection bounding boxes, severely impairing distance estimation and collision avoidance without detection.
Authors:Liwen Liu, Weidong Yang, Lipeng Ma, Ben Fei
Abstract:
Recent advances in multi-modal pre-training methods have shown promising effectiveness in learning 3D representations by aligning multi-modal features between 3D shapes and their corresponding 2D counterparts. However, existing multi-modal pre-training frameworks primarily rely on a single pre-training task to gather multi-modal data in 3D applications. This limitation prevents the models from obtaining the abundant information provided by other relevant tasks, which can hinder their performance in downstream tasks, particularly in complex and diverse domains. In order to tackle this issue, we propose MMPT, a Multi-modal Multi-task Pre-training framework designed to enhance point cloud understanding. Specifically, three pre-training tasks are devised: (i) Token-level reconstruction (TLR) aims to recover masked point tokens, endowing the model with representative learning abilities. (ii) Point-level reconstruction (PLR) is integrated to predict the masked point positions directly, and the reconstructed point cloud can be considered as a transformed point cloud used in the subsequent task. (iii) Multi-modal contrastive learning (MCL) combines feature correspondences within and across modalities, thus assembling a rich learning signal from both 3D point cloud and 2D image modalities in a self-supervised manner. Moreover, this framework operates without requiring any 3D annotations, making it scalable for use with large datasets. The trained encoder can be effectively transferred to various downstream tasks. To demonstrate its effectiveness, we evaluated its performance compared to state-of-the-art methods in various discriminant and generative applications under widely-used benchmarks.
中文: 提出的MMPT框架通过多模态多任务预训练,结合三项自监督学习任务提升3D点云理解能力,无需3D标注即可实现优越性能。
English: The proposed MMPT framework enhances 3D point cloud understanding through multi-modal multi-task pre-training with three self-supervised tasks, achieving superior performance without requiring 3D annotations.
Authors:Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum
Abstract:
Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.
Chinese Summary: 该研究提出了Zebra-CoT大规模数据集,通过提供多样化的视觉-文本推理链来增强多模态推理能力,在Anole-7B和Bagel-7B等模型上进行微调后,显著提升了复杂任务上的性能表现。
English Summary: The study introduces Zebra-CoT, a large-scale dataset designed to enhance multimodal reasoning by providing diverse visual-textual reasoning chains, which significantly improves model performance on complex tasks when fine-tuned on models like Anole-7B and Bagel-7B.
Authors:Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum
Abstract:
Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.
Chinese Summary: 该研究提出了Zebra-CoT大规模数据集,通过提供多样化的视觉-文本推理链来增强多模态推理能力,在Anole-7B和Bagel-7B等模型上进行微调后,显著提升了复杂任务上的性能表现。
English Summary: The study introduces Zebra-CoT, a large-scale dataset designed to enhance multimodal reasoning by providing diverse visual-textual reasoning chains, which significantly improves model performance on complex tasks when fine-tuned on models like Anole-7B and Bagel-7B.
Authors:Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu
Abstract:
Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm .
中文:当前多模态大语言模型在上下文学习中难以有效利用视觉信息,导致仅依赖文本模仿,因此本文提出DARA微调方法和TrueMICL数据集来提升真正的多模态学习能力。
English: Current Multimodal Large Language Models struggle to effectively utilize visual information in in-context learning, leading to text-only adaptation, so this paper introduces DARA fine-tuning and TrueMICL dataset to enhance genuine multimodal capabilities.
Authors:Danyang Zhang, Junhao Song, Ziqian Bi, Yingfang Yuan, Tianyang Wang, Joe Yeong, Junfeng Hao
Abstract:
This paper presents a comprehensive review of the Mixture-of-Experts (MoE) architecture in large language models, highlighting its ability to significantly enhance model performance while maintaining minimal computational overhead. Through a systematic analysis spanning theoretical foundations, core architectural designs, and large language model (LLM) applications, we examine expert gating and routing mechanisms, hierarchical and sparse MoE configurations, meta-learning approaches, multimodal and multitask learning scenarios, real-world deployment cases, and recent advances and challenges in deep learning. Our analysis identifies key advantages of MoE, including superior model capacity compared to equivalent Bayesian approaches, improved task-specific performance, and the ability to scale model capacity efficiently. We also underscore the importance of ensuring expert diversity, accurate calibration, and reliable inference aggregation, as these are essential for maximizing the effectiveness of MoE architectures. Finally, this review outlines current research limitations, open challenges, and promising future directions, providing a foundation for continued innovation in MoE architecture and its applications.
本文综述了大型语言模型中混合专家(MoE)架构,强调其在低计算成本下提升性能的优势,同时探讨了关键机制、应用场景、主要优势及未来挑战。
This paper reviews the Mixture-of-Experts (MoE) architecture in large language models, emphasizing its enhanced performance with low computational cost, while addressing key mechanisms, applications, advantages, and future challenges.
Authors:Ping Xu, Pengfei Wang, Zhiyuan Ning, Meng Xiao, Min Wu, Yuanchun Zhou
Abstract:
Clustering analysis is fundamental in single-cell RNA sequencing (scRNA-seq) data analysis for elucidating cellular heterogeneity and diversity. Recent graph-based scRNA-seq clustering methods, particularly graph neural networks (GNNs), have significantly improved in tackling the challenges of high-dimension, high-sparsity, and frequent dropout events that lead to ambiguous cell population boundaries. However, their reliance on hard graph constructions derived from thresholded similarity matrices presents challenges:(i) The simplification of intercellular relationships into binary edges (0 or 1) by applying thresholds, which restricts the capture of continuous similarity features among cells and leads to significant information loss.(ii) The presence of significant inter-cluster connections within hard graphs, which can confuse GNN methods that rely heavily on graph structures, potentially causing erroneous message propagation and biased clustering outcomes. To tackle these challenges, we introduce scSGC, a Soft Graph Clustering for single-cell RNA sequencing data, which aims to more accurately characterize continuous similarities among cells through non-binary edge weights, thereby mitigating the limitations of rigid data structures. The scSGC framework comprises three core components: (i) a zero-inflated negative binomial (ZINB)-based feature autoencoder; (ii) a dual-channel cut-informed soft graph embedding module; and (iii) an optimal transport-based clustering optimization module. Extensive experiments across ten datasets demonstrate that scSGC outperforms 13 state-of-the-art clustering models in clustering accuracy, cell type annotation, and computational efficiency. These results highlight its substantial potential to advance scRNA-seq data analysis and deepen our understanding of cellular heterogeneity.
中文:scSGC框架通过引入软图聚类方法,利用连续边权重克服硬图构建的局限,在多个数据集中显著提升了单细胞RNA测序数据的聚类精度与效率。
English: The scSGC framework introduces soft graph clustering for scRNA-seq data to overcome the limitations of hard graph constructions by using continuous edge weights, significantly improving clustering accuracy and efficiency across multiple datasets.
Authors:Wei Chen, Wanyang Gu, Linjun Peng, Ruichu Cai, Zhifeng Hao, Kun Zhang
Abstract:
Federated causal discovery aims to uncover the causal relationships between entities while protecting data privacy, which has significant importance and numerous applications in real-world scenarios. Existing federated causal structure learning methods primarily focus on horizontal federated settings. However, in practical situations, different clients may not necessarily contain data on the same variables. In a single client, the incomplete set of variables can easily lead to spurious causal relationships, thereby affecting the information transmitted to other clients. To address this issue, we comprehensively consider causal structure learning methods under both horizontal and vertical federated settings. We provide the identification theories and methods for learning causal structure in the horizontal and vertical federal setting via higher-order cumulants. Specifically, we first aggregate higher-order cumulant information from all participating clients to construct global cumulant estimates. These global estimates are then used for recursive source identification, ultimately yielding a global causal strength matrix. Our approach not only enables the reconstruction of causal graphs but also facilitates the estimation of causal strength coefficients. Our algorithm demonstrates superior performance in experiments conducted on both synthetic data and real-world data.
中文: 本文提出了一种基于高阶累积量的联邦因果发现方法,能够在横向和纵向联邦场景下学习因果结构,实现准确的图重构和强度估计,同时保护数据隐私,并在实验中表现出优越性能。
English: This paper introduces a federated causal discovery method that utilizes higher-order cumulants to learn causal structures in both horizontal and vertical settings, enabling accurate graph reconstruction and strength estimation while maintaining data privacy and showing strong experimental performance.
Authors:Anshuk Uppal, Yuhta Takida, Chieh-Hsin Lai, Yuki Mitsufuji
Abstract:
Disentangled and interpretable latent representations in generative models typically come at the cost of generation quality. The $β$-VAE framework introduces a hyperparameter $β$ to balance disentanglement and reconstruction quality, where setting $β> 1$ introduces an information bottleneck that favors disentanglement over sharp, accurate reconstructions. To address this trade-off, we propose a novel generative modeling framework that leverages a range of $β$ values to learn multiple corresponding latent representations. First, we obtain a slew of representations by training a single variational autoencoder (VAE), with a new loss function that controls the information retained in each latent representation such that the higher $β$ value prioritize disentanglement over reconstruction fidelity. We then, introduce a non-linear diffusion model that smoothly transitions latent representations corresponding to different $β$ values. This model denoises towards less disentangled and more informative representations, ultimately leading to (almost) lossless representations, enabling sharp reconstructions. Furthermore, our model supports sample generation without input images, functioning as a standalone generative model. We evaluate our framework in terms of both disentanglement and generation quality. Additionally, we observe smooth transitions in the latent spaces with respect to changes in $β$, facilitating consistent manipulation of generated outputs.
中文: 该生成框架通过新型损失函数训练单一变分自编码器来学习平衡解耦与重建质量的多重潜在表示,并利用扩散模型在这些表示间平滑过渡,从而实现高保真生成和可控操作。
English: The proposed generative framework trains a single VAE with a novel loss function to learn multiple latent representations balancing disentanglement and reconstruction quality, then employs a diffusion model to transition between these representations for high-fidelity generation and manipulation.
Authors:Yonghyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Woosung Choi, Kin Wai Cheuk, Junghyun Koo, Yuki Mitsufuji
Abstract:
While diffusion models excel at image generation, their growing adoption raises critical concerns around copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to specific elements, such as styles or objects, that matter most to stakeholders. To bridge this gap, we introduce \emph{concept-level attribution} via a novel method called \emph{Concept-TRAK}. Concept-TRAK extends influence functions with two key innovations: (1) a reformulated diffusion training loss based on diffusion posterior sampling, enabling robust, sample-specific attribution; and (2) a concept-aware reward function that emphasizes semantic relevance. We evaluate Concept-TRAK on the AbC benchmark, showing substantial improvements over prior methods. Through diverse case studies--ranging from identifying IP-protected and unsafe content to analyzing prompt engineering and compositional learning--we demonstrate how concept-level attribution yields actionable insights for responsible generative AI development and governance.
中文摘要:本文提出Concept-TRAK方法,通过概念级归因技术增强扩散模型透明度,能精准识别训练数据对图像特定元素的影响,在多项应用中显著优于现有方法。
English Summary: The paper introduces Concept-TRAK, a novel method for concept-level attribution in diffusion models that improves transparency by identifying training influences on specific image elements, outperforming prior approaches in various applications.
Authors:Jiayi Song, Zihan Ye, Qingyuan Zhou, Weidong Yang, Ben Fei, Jingyi Xu, Ying He, Wanli Ouyang
Abstract:
Accurately rendering scenes with reflective surfaces remains a significant challenge in novel view synthesis, as existing methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) often misinterpret reflections as physical geometry, resulting in degraded reconstructions. Previous methods rely on incomplete and non-generalizable geometric constraints, leading to misalignment between the positions of Gaussian splats and the actual scene geometry. When dealing with real-world scenes containing complex geometry, the accumulation of Gaussians further exacerbates surface artifacts and results in blurred reconstructions. To address these limitations, in this work, we propose Ref-Unlock, a novel geometry-aware reflection modeling framework based on 3D Gaussian Splatting, which explicitly disentangles transmitted and reflected components to better capture complex reflections and enhance geometric consistency in real-world scenes. Our approach employs a dual-branch representation with high-order spherical harmonics to capture high-frequency reflective details, alongside a reflection removal module providing pseudo reflection-free supervision to guide clean decomposition. Additionally, we incorporate pseudo-depth maps and a geometry-aware bilateral smoothness constraint to enhance 3D geometric consistency and stability in decomposition. Extensive experiments demonstrate that Ref-Unlock significantly outperforms classical GS-based reflection methods and achieves competitive results with NeRF-based models, while enabling flexible vision foundation models (VFMs) driven reflection editing. Our method thus offers an efficient and generalizable solution for realistic rendering of reflective scenes. Our code is available at https://ref-unlock.github.io/.
中文摘要:Ref-Unlock提出了一种基于3D高斯泼溅的几何感知反射建模框架,通过显式分离透射和反射分量,有效提升复杂反射场景的重建质量并实现反射编辑功能。
English Summary: Ref-Unlock introduces a geometry-aware reflection modeling framework using 3D Gaussian Splatting that explicitly separates transmitted and reflected components, significantly improving reconstruction quality and enabling reflection editing in complex reflective scenes.
Authors:Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Caiming Xiong, Junnan Li
Abstract:
Graphical user interface (GUI) agents autonomously operate across platforms (e.g., Linux) to complete tasks by interacting with visual elements. Specifically, a user instruction is decomposed into a sequence of action proposals, each corresponding to an interaction with the GUI. After each action, the agent observes the updated GUI environment to plan the next step. However, two main challenges arise: i) resolving ambiguity in task planning (i.e., the action proposal sequence), where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, i.e., precisely interacting with visual targets.
This paper investigates the two aforementioned challenges with our GUI Test-time Scaling Agent, namely GTA1. First, to select the most appropriate action proposal, we introduce a test-time scaling method. At each step, we sample multiple candidate action proposals and leverage a judge model to evaluate and select the most suitable one. It trades off computation for better decision quality by concurrent sampling, shortening task execution steps, and improving overall performance. Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates visual grounding through inherent objective alignments, rewarding successful clicks on interface elements.
Experimentally, our method establishes state-of-the-art performance across diverse benchmarks. For example, GTA1-7B achieves 50.1%, 92.4%, and 67.7% accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G, respectively. When paired with a planner applying our test-time scaling strategy, it exhibits state-of-the-art agentic performance (e.g., 45.2% task success rate on OSWorld). We open-source our code and models here.
中文: 图形用户界面(GUI)代理通过将用户指令分解为动作序列并与视觉元素交互来自主完成任务,但面临任务规划模糊性和动作精准定位的挑战,本文通过测试时扩展策略选择最优动作,并利用强化学习实现精确的视觉目标交互来解决这些问题。
English: GUI agents autonomously perform tasks by decomposing user instructions into action sequences and interacting with visual elements, but face challenges in task planning ambiguity and accurate action grounding, which are addressed through test-time scaling for optimal action selection and reinforcement learning for precise visual targeting.
Authors:Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, Semih Yavuz
Abstract:
Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.
中文: 该研究提出了VLM2Vec-V2统一框架,将多模态嵌入学习扩展到视频和视觉文档等多样化视觉形式,在新旧基准测试中均展现优异性能,并为可扩展的表征学习提供了重要见解。
English: The study introduces VLM2Vec-V2, a unified framework that extends multimodal embedding learning to diverse visual forms like videos and documents, demonstrating strong performance across new and existing benchmarks while providing insights for scalable representation learning.
Authors:Meng Xiao, Junfeng Zhou, Yuanchun Zhou
Abstract:
Feature generation (FG) aims to enhance the prediction potential of original data by constructing high-order feature combinations and removing redundant features. It is a key preprocessing step for tabular scientific data to improve downstream machine-learning model performance. Traditional methods face the following two challenges when dealing with the feature generation of scientific data: First, the effective construction of high-order feature combinations in scientific data necessitates profound and extensive domain-specific expertise. Secondly, as the order of feature combinations increases, the search space expands exponentially, imposing prohibitive human labor consumption. Advancements in the Data-Centric Artificial Intelligence (DCAI) paradigm have opened novel avenues for automating feature generation processes. Inspired by that, this paper revisits the conventional feature generation workflow and proposes the Multi-agent Feature Generation (MAFG) framework. Specifically, in the iterative exploration stage, multi-agents will construct mathematical transformation equations collaboratively, synthesize and identify feature combinations ex-hibiting high information content, and leverage a reinforcement learning mechanism to evolve their strategies. Upon completing the exploration phase, MAFG integrates the large language models (LLMs) to interpreta-tively evaluate the generated features of each significant model performance breakthrough. Experimental results and case studies consistently demonstrate that the MAFG framework effectively automates the feature generation process and significantly enhances various downstream scientific data mining tasks.
中文摘要:MAFG框架通过多智能体协作与强化学习构建高阶特征组合,并利用大语言模型解释性评估关键突破,有效实现了科学数据特征生成的自动化,显著提升了下游任务性能。
English Summary: The MAFG framework automates feature generation for scientific data by using multi-agent collaboration and reinforcement learning to construct high-order feature combinations, with LLMs interpretatively evaluating breakthroughs, significantly enhancing downstream tasks.
Authors:Felix Friedrich, Thiemo Ganesha Welsch, Manuel Brack, Patrick Schramowski, Kristian Kersting
Abstract:
Current diversification strategies for text-to-image (T2I) models often ignore contextual appropriateness, leading to over-diversification where demographic attributes are modified even when explicitly specified in prompts. This paper introduces DIVBENCH, a benchmark and evaluation framework for measuring both under- and over-diversification in T2I generation. Through systematic evaluation of state-of-the-art T2I models, we find that while most models exhibit limited diversity, many diversification approaches overcorrect by inappropriately altering contextually-specified attributes. We demonstrate that context-aware methods, particularly LLM-guided FairDiffusion and prompt rewriting, can already effectively address under-diversity while avoiding over-diversification, achieving a better balance between representation and semantic fidelity.
中文: 当前文本到图像模型常无法平衡多样性与上下文准确性,要么多样性不足,要么过度修正而改变指定属性,但上下文感知方法如LLM引导的FairDiffusion能有效解决这两个问题,提升表征能力和语义保真度。
English: Current text-to-image models often fail to balance diversity with contextual accuracy, either under-diversifying or overcorrecting by altering specified attributes, but context-aware methods like LLM-guided FairDiffusion can effectively address both issues to improve representation and fidelity.
Authors:Yen-Tung Yeh, Junghyun Koo, Marco A. MartÃnez-RamÃrez, Wei-Hsiang Liao, Yi-Hsuan Yang, Yuki Mitsufuji
Abstract:
General-purpose audio representations have proven effective across diverse music information retrieval applications, yet their utility in intelligent music production remains limited by insufficient understanding of audio effects (Fx). Although previous approaches have emphasized audio effects analysis at the mixture level, this focus falls short for tasks demanding instrument-wise audio effects understanding, such as automatic mixing. In this work, we present Fx-Encoder++, a novel model designed to extract instrument-wise audio effects representations from music mixtures. Our approach leverages a contrastive learning framework and introduces an "extractor" mechanism that, when provided with instrument queries (audio or text), transforms mixture-level audio effects embeddings into instrument-wise audio effects embeddings. We evaluated our model across retrieval and audio effects parameter matching tasks, testing its performance across a diverse range of instruments. The results demonstrate that Fx-Encoder++ outperforms previous approaches at mixture level and show a novel ability to extract effects representation instrument-wise, addressing a critical capability gap in intelligent music production systems.
中文摘要:Fx-Encoder++通过对比学习框架和提取器机制,实现了从音乐混音中提取乐器级音频效果表征的创新方法,在多项任务中超越原有混合级分析模型,填补了智能音乐制作系统的关键能力空白。
English Summary: Fx-Encoder++ introduces a novel model using contrastive learning and an extractor mechanism to derive instrument-specific audio effects representations from music mixtures, outperforming previous mixture-level approaches and enabling intelligent music production capabilities.
Authors:Zehua Fu, Chenguang Liu, Yuyu Chen, Jiaqi Zhou, Qingjie Liu, Yunhong Wang
Abstract:
Despite its significant success, object detection in traffic and transportation scenarios requires time-consuming and laborious efforts in acquiring high-quality labeled data. Therefore, Unsupervised Domain Adaptation (UDA) for object detection has recently gained increasing research attention. UDA for object detection has been dominated by domain alignment methods, which achieve top performance. Recently, self-labeling methods have gained popularity due to their simplicity and efficiency. In this paper, we investigate the limitations that prevent self-labeling detectors from achieving commensurate performance with domain alignment methods. Specifically, we identify the high proportion of simple samples during training, i.e., the simple-label bias, as the central cause. We propose a novel approach called De-Simplifying Pseudo Labels (DeSimPL) to mitigate the issue. DeSimPL utilizes an instance-level memory bank to implement an innovative pseudo label updating strategy. Then, adversarial samples are introduced during training to enhance the proportion. Furthermore, we propose an adaptive weighted loss to avoid the model suffering from an abundance of false positive pseudo labels in the late training period. Experimental results demonstrate that DeSimPL effectively reduces the proportion of simple samples during training, leading to a significant performance improvement for self-labeling detectors. Extensive experiments conducted on four benchmarks validate our analysis and conclusions.
中文: 本文发现简单标签偏差是自标注目标检测方法的主要局限,并提出DeSimPL方法,通过实例级记忆库、对抗样本和自适应加权损失来缓解该问题,在多个基准测试中显著提升了性能。
English: This paper identifies the simple-label bias as the key limitation in self-labeling object detection methods and proposes DeSimPL, a novel approach that mitigates this issue through an instance-level memory bank, adversarial samples, and adaptive weighted loss, significantly improving performance on benchmarks.
Authors:Han Jiang, Pengda Wang, Xiaoyuan Yi, Xing Xie, Ziang Xiao
Abstract:
Social sciences have accumulated a rich body of theories and methodologies for investigating the human mind and behaviors, while offering valuable insights into the design and understanding of Artificial Intelligence (AI) systems. Focusing on psychology as a prominent case, this study explores the interdisciplinary synergy between AI and the field by analyzing 1,006 LLM-related papers published in premier AI venues between 2023 and 2025, along with the 2,544 psychology publications they cite. Through our analysis, we identify key patterns of interdisciplinary integration, locate the psychology domains most frequently referenced, and highlight areas that remain underexplored. We further examine how psychology theories/frameworks are operationalized and interpreted, identify common types of misapplication, and offer guidance for more effective incorporation. Our work provides a comprehensive map of interdisciplinary engagement between AI and psychology, thereby facilitating deeper collaboration and advancing AI systems.
中文: 本研究通过分析1006篇人工智能论文及其引用的2544篇心理学文献,揭示了AI与心理学的跨学科融合模式、理论应用方法及待深化领域,为促进更有效合作与AI系统发展提供了全面路线图。
English: This study analyzes interdisciplinary integration between AI and psychology by examining 1,006 AI papers and their 2,544 psychology references, identifying key collaboration patterns, operationalization methods, and areas for improved synergy to advance AI systems.
Authors:Jiaxin Liu, Qichao Ying, Zhenxing Qian, Sheng Li, Runqi Zhang, Jian Liu, Xinpeng Zhang
Abstract:
The widespread use of face retouching on social media platforms raises concerns about the authenticity of face images. While existing methods focus on detecting face retouching, how to accurately recover the original faces from the retouched ones has yet to be answered. This paper introduces Face Retouching Restoration (FRR), a novel computer vision task aimed at restoring original faces from their retouched counterparts. FRR differs from traditional image restoration tasks by addressing the complex retouching operations with various types and degrees, which focuses more on the restoration of the low-frequency information of the faces. To tackle this challenge, we propose MoFRR, Mixture of Diffusion Models for FRR. Inspired by DeepSeek's expert isolation strategy, the MoFRR uses sparse activation of specialized experts handling distinct retouching types and the engagement of a shared expert dealing with universal retouching traces. Each specialized expert follows a dual-branch structure with a DDIM-based low-frequency branch guided by an Iterative Distortion Evaluation Module (IDEM) and a Cross-Attention-based High-Frequency branch (HFCAM) for detail refinement. Extensive experiments on a newly constructed face retouching dataset, RetouchingFFHQ++, demonstrate the effectiveness of MoFRR for FRR.
中文摘要:本文提出面部修图恢复(FRR)作为新的计算机视觉任务,旨在从修图后的人脸恢复原始图像,并设计了MoFRR模型——采用混合专家扩散模型,通过双分支结构和新建数据集验证了其有效性。
English Summary: This paper introduces Face Retouching Restoration (FRR) as a novel computer vision task to recover original faces from retouched images, proposing MoFRR—a mixture of diffusion models with specialized and shared experts that achieves effective restoration through dual-branch processing.
Authors:Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, Jiangmiao Pang
Abstract:
To operate effectively in the real world, robots must integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize textual reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 92% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA's potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.
中文: InstructVLA是一种端到端的视觉-语言-动作模型,通过创新的训练范式在保持灵活推理能力的同时实现了领先的操作性能,在专业任务和泛化基准测试中均展现出显著优势。
English: InstructVLA is an end-to-end vision-language-action model that preserves flexible reasoning while achieving leading manipulation performance through a novel training paradigm, demonstrating significant improvements in both specialized tasks and generalization benchmarks.
Authors:Han Jiang, Dongyao Zhu, Zhihua Wei, Xiaoyuan Yi, Ziang Xiao, Xing Xie
Abstract:
In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs' understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.
中文摘要:情境对齐方法在处理价值冲突时存在局限,而提出的PICACO方法通过优化元指令,无需微调即可使大语言模型更好地理解并协调多种人类价值观,在多个价值集上实现了更优的对齐效果。
English Summary: In-Context Alignment (ICA) faces limitations in addressing value conflicts, but the proposed PICACO method overcomes this by optimizing meta-instructions to better align LLMs with multiple human values without fine-tuning, achieving superior performance across diverse value sets.
Authors:Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Liu, Yingya Zhang, Shiwei Zhang, Difan Zou
Abstract:
Although unified MLLMs aim to unify generation and understanding, they are considered to exhibit an internal gap, with understanding outperforming generation. Through large-scale evaluation across multiple MLLMs and tasks, we confirm the widespread non-unification of MLLMs, and demonstrate that it indeed stems from weak generation rather than misunderstanding. This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework, which mitigates internal gaps by leveraging stronger understanding to guide weaker generation without relying on any external signals. We validate this strategy through comprehensive experiments: scoring generations with understanding to construct image data for post-training (e.g., SFT and DPO) significantly improves generation while promoting unification. Furthermore, we empirically discover a co-improvement effect of such self-improvement, a phenomenon well known in pre-training but underexplored in post-training. Specifically, as generation improves, understanding becomes more effective at detecting false positives that were previously misclassified as prompt-aligned. To explain this effect, we extend learning dynamic theory to the MLLM setting, showing that the shared empirical neural tangent kernel between generation and understanding encourages aligned learning dynamics, thereby driving co-improvement. This interplay between generation and understanding further motivates a curriculum learning approach for stronger self-improvement: progressively enhanced understanding and generation revisit samples underutilized by pre-trained MLLMs, dynamically expanding post-training data and leading to improved performance and unification.
中文摘要:本研究揭示了多模态大语言模型中理解能力优于生成能力的内在差距,并提出一种自改进框架,利用更强的理解引导较弱的生成,实验验证该方法能有效提升性能并促进模型统一。
English Summary: This study identifies a performance gap in multimodal large language models where understanding surpasses generation, and introduces a self-improvement framework that uses stronger understanding to guide weaker generation, validated through experiments showing enhanced performance and unification.
Authors:Sasindu Wijeratne, Rajgopal Kannan, Viktor Prasanna
Abstract:
Matricized Tensor Times Khatri-Rao Product (MTTKRP) is the computational bottleneck in sparse tensor decomposition. As real-world sparse tensors grow to billions of nonzeros, they increasingly demand higher memory capacity and compute throughput from hardware accelerators. In this work, we present AMPED, a multi-GPU parallel algorithm designed to accelerate MTTKRP on billion-scale sparse tensors. AMPED scales beyond the limits of a single GPU, meeting both the memory and performance requirements of large-scale workloads. We introduce a partitioning strategy combined with a dynamic load balancing scheme to distribute computation and minimize GPU idle time. On real-world billion-scale tensors, AMPED achieves a 5.1x geometric mean speedup in total execution time over state-of-the-art GPU baselines using 4 GPUs on a single CPU node.
中文: AMPED是一种多GPU并行算法,通过分区策略和动态负载均衡加速十亿级稀疏张量的MTTKRP计算,相比现有GPU方法实现了5.1倍的性能提升。
English: AMPED is a multi-GPU parallel algorithm that accelerates MTTKRP for billion-scale sparse tensors by employing partitioning and dynamic load balancing, achieving a 5.1x speedup over leading GPU methods.
Authors:Shiyi Mu, Yongkang Liu, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang
Abstract:
Although large language models (LLMs) excel in text comprehension and generation, their performance on the Emotion-Cause Pair Extraction (ECPE) task, which requires reasoning ability, is often underperform smaller language model. The main reason is the lack of auxiliary knowledge, which limits LLMs' ability to effectively perceive emotions and reason causes. To address this issue, we propose a novel \textbf{M}ulti-source h\textbf{E}terogeneous \textbf{K}nowledge \textbf{i}njection me\textbf{T}hod, MEKiT, which integrates heterogeneous internal emotional knowledge and external causal knowledge. Specifically, for these two distinct aspects and structures of knowledge, we apply the approaches of incorporating instruction templates and mixing data for instruction-tuning, which respectively facilitate LLMs in more comprehensively identifying emotion and accurately reasoning causes. Experimental results demonstrate that MEKiT provides a more effective and adaptable solution for the ECPE task, exhibiting an absolute performance advantage over compared baselines and dramatically improving the performance of LLMs on the ECPE task.
中文摘要:大语言模型在情感-原因对提取任务中表现欠佳,但通过融合内部情感知识与外部因果知识的MEKiT方法,借助指令模板和数据混合技术显著提升了其识别与推理能力。
English Summary: Large language models underperform in Emotion-Cause Pair Extraction due to knowledge gaps, but the proposed MEKiT method enhances their reasoning by integrating emotional and causal knowledge through instruction tuning.
Authors:Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, Jiangmiao Pang
Abstract:
Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment's overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. The code is available at https://crystalsixone.github.io/vln_pe.github.io/.
中文: 针对当前视觉语言导航在物理部署中的理想化假设,VLN-PE平台首次系统评估了多类机器人导航方法,揭示了观测受限、光照变化和物理碰撞导致的性能下降问题,为提升跨载体适应性开辟了新途径。
English: Recent VLN advancements overlook physical deployment challenges, so VLN-PE introduces a realistic platform for humanoid, quadruped, and wheeled robots, revealing performance issues due to limited observations, lighting variations, and physical constraints while offering a pathway for improved adaptability.
Authors:Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, Jiangmiao Pang
Abstract:
Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment's overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. The code is available at https://crystalsixone.github.io/vln_pe.github.io/.
中文: 针对当前视觉语言导航在物理部署中的理想化假设,VLN-PE平台首次系统评估了多类机器人导航方法,揭示了观测受限、光照变化和物理碰撞导致的性能下降问题,为提升跨载体适应性开辟了新途径。
English: Recent VLN advancements overlook physical deployment challenges, so VLN-PE introduces a realistic platform for humanoid, quadruped, and wheeled robots, revealing performance issues due to limited observations, lighting variations, and physical constraints while offering a pathway for improved adaptability.
Authors:Yifan Xu, Chao Zhang, Hanqi Jiang, Xiaoyan Wang, Ruifei Ma, Yiwei Li, Zihao Wu, Zeju Li, Xiangde Liu
Abstract:
Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend Large Language Models (LLMs) for tackling tasks of 3D scene understanding. Current methods rely heavily on 3D point clouds, but the 3D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. 2D multi-view images present visual consistency with 3D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3D multimodal framework that leverages multi-view images for enhanced 3D scene understanding with LLMs. In general, Argus can be treated as a 3D Large Multimodal Foundation Model (3D-LMM) since it takes various modalities as input(text instructions, 2D multi-view images, and 3D point clouds) and expands the capability of LLMs to tackle 3D tasks. Argus involves fusing and integrating multi-view images and camera poses into view-as-scene features, which interact with the 3D features to create comprehensive and detailed 3D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3D point clouds and helps LLMs better understand the 3D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.
中文:提出的Argus框架通过融合多视角图像与3D点云来增强三维场景理解,弥补重建过程中的信息缺失,使大语言模型能更有效地处理三维任务。
English: The proposed Argus framework enhances 3D scene understanding by integrating multi-view images with 3D point clouds, compensating for reconstruction losses and enabling Large Language Models to handle 3D tasks more effectively.
Authors:Kazuki Shimada, Archontis Politis, Iran R. Roman, Parthasaarathy Sudarsanam, David Diaz-Guerra, Ruchi Pandey, Kengo Uchida, Yuichiro Koyama, Naoya Takahashi, Takashi Shibuya, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji
Abstract:
This paper presents the objective, dataset, baseline, and metrics of Task 3 of the DCASE2025 Challenge on sound event localization and detection (SELD). In previous editions, the challenge used four-channel audio formats of first-order Ambisonics (FOA) and microphone array. In contrast, this year's challenge investigates SELD with stereo audio data (termed stereo SELD). This change shifts the focus from more specialized 360° audio and audiovisual scene analysis to more commonplace audio and media scenarios with limited field-of-view (FOV). Due to inherent angular ambiguities in stereo audio data, the task focuses on direction-of-arrival (DOA) estimation in the azimuth plane (left-right axis) along with distance estimation. The challenge remains divided into two tracks: audio-only and audiovisual, with the audiovisual track introducing a new sub-task of onscreen/offscreen event classification necessitated by the limited FOV. This challenge introduces the DCASE2025 Task3 Stereo SELD Dataset, whose stereo audio and perspective video clips are sampled and converted from the STARSS23 recordings. The baseline system is designed to process stereo audio and corresponding video frames as inputs. In addition to the typical SELD event classification and localization, it integrates onscreen/offscreen classification for the audiovisual track. The evaluation metrics have been modified to introduce an onscreen/offscreen accuracy metric, which assesses the models' ability to identify which sound sources are onscreen. In the experimental evaluation, the baseline system performs reasonably well with the stereo audio data.
中文:DCASE2025挑战赛任务3采用立体声音频进行声音事件定位与检测,聚焦方位角方向和距离估计,并为视听赛道新增屏内/屏外事件分类任务,同时引入了新数据集和修订的评估指标。
English: The DCASE2025 Challenge Task 3 introduces stereo audio for sound event localization and detection, focusing on azimuth direction and distance estimation while adding onscreen/offscreen classification for the audiovisual track, with a new dataset and modified evaluation metrics.
Authors:Charidimos Papadakis, Giorgos Filandrianos, Angeliki Dimitriou, Maria Lymperaiou, Konstantinos Thomas, Giorgos Stamou
Abstract:
We present StockSim, an open-source simulation platform for systematic evaluation of large language models (LLMs) in realistic financial decision-making scenarios. Unlike previous toolkits that offer limited scope, StockSim delivers a comprehensive system that fully models market dynamics and supports diverse simulation modes of varying granularity. It incorporates critical real-world factors, such as latency, slippage, and order-book microstructure, that were previously neglected, enabling more faithful and insightful assessment of LLM-based trading agents. An extensible, role-based agent framework supports heterogeneous trading strategies and multi-agent coordination, making StockSim a uniquely capable testbed for NLP research on reasoning under uncertainty and sequential decision-making. We open-source all our code at https: //github.com/harrypapa2002/StockSim.
中文: StockSim是一个开源平台,全面模拟市场动态并融入延迟和滑点等现实因素,用于在真实金融场景中评估基于大语言模型的交易代理。
English: StockSim is an open-source platform that comprehensively models market dynamics with real-world factors like latency and slippage to evaluate LLM-based trading agents in realistic financial scenarios.
Authors:JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, Jiangmiao Pang
Abstract:
Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/
中文: 本文提出OST-Bench这一新型基准测试,通过主动场景探索评估多模态大模型的在线时空推理能力,发现现有模型在处理动态记忆和复杂空间任务方面存在明显不足,揭示了该领域亟待解决的核心挑战。
English: This paper introduces OST-Bench, a novel benchmark for evaluating multimodal language models' online spatio-temporal reasoning abilities through active scene exploration, revealing their limitations in handling dynamic memory and complex spatial tasks despite current advancements.
Authors:Chaoshuo Zhang, Chenhao Lin, Zhengyu Zhao, Le Yang, Qian Wang, Chao Shen
Abstract:
Text-to-image diffusion models (T2I DMs), represented by Stable Diffusion, which generate highly realistic images based on textual input, have been widely used. However, their misuse poses serious security risks. While existing concept unlearning methods aim to mitigate these risks, they struggle to balance unlearning effectiveness with generative retainability.To overcome this limitation, we innovatively propose the Key Step Concept Unlearning (KSCU) method, which ingeniously capitalizes on the unique stepwise sampling characteristic inherent in diffusion models during the image generation process. Unlike conventional approaches that treat all denoising steps equally, KSCU strategically focuses on pivotal steps with the most influence over the final outcome by dividing key steps for different concept unlearning tasks and fine-tuning the model only at those steps. This targeted approach reduces the number of parameter updates needed for effective unlearning, while maximizing the retention of the model's generative capabilities.Through extensive benchmark experiments, we demonstrate that KSCU effectively prevents T2I DMs from generating undesirable images while better retaining the model's generative capabilities. Our code will be released.
Chinese: 为解决文本到图像扩散模型中概念遗忘效果与生成质量难以平衡的问题,我们提出关键步骤概念遗忘方法(KSCU),通过选择性微调关键去噪步骤,在高效消除不良概念的同时保持图像质量,其性能显著优于现有方法。
English: To address the trade-off between unlearning effectiveness and generation quality in text-to-image diffusion models, we propose Key Step Concept Unlearning (KSCU), which selectively fine-tunes key denoising steps to efficiently remove unwanted concepts while preserving image quality, achieving superior performance over existing methods.
Authors:Chaoshuo Zhang, Chenhao Lin, Zhengyu Zhao, Le Yang, Qian Wang, Chao Shen
Abstract:
Text-to-image diffusion models (T2I DMs), represented by Stable Diffusion, which generate highly realistic images based on textual input, have been widely used, but their flexibility also makes them prone to misuse for producing harmful or unsafe content. Concept unlearning has been used to prevent text-to-image diffusion models from being misused to generate undesirable visual content. However, existing methods struggle to trade off unlearning effectiveness with the preservation of generation quality. To address this limitation, we propose Key Step Concept Unlearning (KSCU), which selectively fine-tunes the model at key steps to the target concept. KSCU is inspired by the fact that different diffusion denoising steps contribute unequally to the final generation. Compared to previous approaches, which treat all denoising steps uniformly, KSCU avoids over-optimization of unnecessary steps for higher effectiveness and reduces the number of parameter updates for higher efficiency. For example, on the I2P dataset, KSCU outperforms ESD by 8.3% in nudity unlearning accuracy while improving FID by 8.4%, and achieves a high overall score of 0.92, substantially surpassing all other SOTA methods.
Chinese: 为解决文本到图像扩散模型中概念遗忘效果与生成质量难以平衡的问题,我们提出关键步骤概念遗忘方法(KSCU),通过选择性微调关键去噪步骤,在高效消除不良概念的同时保持图像质量,其性能显著优于现有方法。
English: To address the trade-off between unlearning effectiveness and generation quality in text-to-image diffusion models, we propose Key Step Concept Unlearning (KSCU), which selectively fine-tunes key denoising steps to efficiently remove unwanted concepts while preserving image quality, achieving superior performance over existing methods.
Authors:Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang
Abstract:
Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: \href{https://streamvln.github.io/}{https://streamvln.github.io/}.
中文: StreamVLN通过慢-快双流上下文建模框架,结合动态令牌剪枝与KV缓存复用技术,在保证细粒度视觉理解的同时实现高效低延迟的视觉语言导航。
English: StreamVLN introduces a hybrid slow-fast context modeling framework that enables efficient, low-latency navigation by balancing fine-grained visual understanding with computational efficiency through dynamic token pruning and KV cache reuse.
Authors:Hengran Zhang, Keping Bi, Jiafeng Guo
Abstract:
While large language models (LLMs) are increasingly deployed as dense retrievers, the impact of their domain-specific specialization on retrieval effectiveness remains underexplored. This investigation systematically examines how task-specific adaptations in LLMs influence their retrieval capabilities, an essential step toward developing unified retrievers capable of handling text, code, images, and multimodal content. We conduct extensive experiments with eight Qwen2.5 7B LLMs, including base, instruction-tuned, code/math-specialized, long reasoning, and vision-language models across zero-shot retrieval settings and the supervised setting. For the zero-shot retrieval settings, we consider text retrieval from the BEIR benchmark and code retrieval from the CoIR benchmark. Further, to evaluate supervised performance, all LLMs are fine-tuned on the MS MARCO dataset. We find that mathematical specialization and the long reasoning capability cause consistent degradation in three settings, indicating conflicts between mathematical reasoning and semantic matching. The vision-language model and code-specialized LLMs demonstrate superior zero-shot performance compared to other LLMs, even surpassing BM25 on the code retrieval task, and maintain comparable performance to base LLMs in supervised settings. These findings suggest promising directions for the unified retrieval task leveraging cross-domain and cross-modal fusion.
中文: 领域特定的大语言模型在检索性能上表现各异,数学专业化会削弱效果,而视觉语言和代码专业化模型在零样本任务中表现卓越,为统一跨模态检索提供了可行方向。
English: Domain-specific LLMs show varied retrieval performance, with math specialization impairing effectiveness while vision-language and code-specialized models excel in zero-shot tasks, offering promising paths for unified cross-modal retrieval.
Authors:Jiayi Lu, Wanting Yang, Zehui Xiong, Rahim Tafazolli, Tony Q. S. Quek, Mérouane Debbah, Dong In Kim
Abstract:
With the booming development of generative artificial intelligence (GAI), semantic communication (SemCom) has emerged as a new paradigm for reliable and efficient communication. This paper considers a multi-user downlink SemCom system, using vehicular networks as the representative scenario for multi-user content dissemination. To address diverse yet overlapping user demands, we propose a multi-user Generative SemCom-enhanced intent-aware semantic-splitting multiple access (SS-MGSC) framework. In the framework, we construct an intent-aware shared knowledge base (SKB) that incorporates prior knowledge of semantic information (SI) and user-specific preferences. Then, we designate the common SI as a one-hot semantic map that is broadcast to all users, while the private SI is delivered as personalized text for each user. On the receiver side, a diffusion model enhanced with ControlNet is adopted to generate high-quality personalized images. To capture both semantic relevance and perceptual similarity, we design a novel semantic efficiency score (SES) metric as the optimization objective. Building on this, we formulate a joint optimization problem for multi-user semantic extraction and beamforming, solved using a reinforcement learning-based algorithm due to its robustness in high-dimensional settings. Simulation results demonstrate the effectiveness of the proposed scheme.
中文摘要:本文提出了一种面向车载网络的多用户生成语义通信框架,通过意图感知知识库和新型语义效率指标优化语义内容分发,并利用强化学习实现性能提升。
English Summary: This paper introduces a multi-user Generative SemCom framework for vehicular networks that optimizes semantic content delivery through an intent-aware knowledge base and a novel semantic efficiency metric, achieving enhanced performance via reinforcement learning.
Authors:Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, Wanxiang Che
Abstract:
Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs' ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.
中文摘要:MPCC基准测试旨在评估多模态大语言模型在复杂现实约束下的规划能力,揭示了当前模型在生成可行计划和应对约束复杂性方面存在显著不足。
English Summary: The MPCC benchmark is introduced to evaluate multimodal large language models' planning abilities with complex real-world constraints, revealing their current limitations in generating feasible plans and handling constraint complexity.
Authors:Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Zhuo Li, Xiaobao Wei, Sixiang Chen, Liyun Li, Xianming Liu, Ming Lu, Yang Wang, Shanghang Zhang
Abstract:
Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes open-loop planning benchmark across different pruning ratios.
中文摘要:FastDriveVLA是一种基于重建的视觉令牌剪枝框架,通过对抗性重建训练优先保留前景信息,在降低计算成本的同时,为自动驾驶系统实现了最先进的性能表现。
English Summary: FastDriveVLA is a novel reconstruction-based visual token pruning framework that enhances autonomous driving systems by prioritizing foreground information through adversarial reconstruction training, achieving state-of-the-art performance while reducing computational costs.
Authors:Zheyuan Zhang, Linkai Peng, Wanying Dou, Cuiling Sun, Halil Ertugrul Aktas, Andrea M. Bejar, Elif Keles, Gorkem Durak, Ulas Bagci
Abstract:
Clinical magnetic-resonance (MR) protocols generate many T1 and T2 sequences whose appearance differs more than the acquisition sites that produce them. Existing domain-generalization benchmarks focus almost on cross-center shifts and overlook this dominant source of variability. Pancreas segmentation remains a major challenge in abdominal imaging: the gland is small, irregularly, surrounded by organs and fat, and often suffers from low T1 contrast. State-of-the-art deep networks that already achieve >90% Dice on the liver or kidneys still miss 20-30% of the pancreas. The organ is also systematically under-represented in public cross-domain benchmarks, despite its clinical importance in early cancer detection, surgery, and diabetes research. To close this gap, we present PancreasDG, a large-scale multi-center 3D MRI pancreas segmentation dataset for investigating domain generalization in medical imaging. The dataset comprises 563 MRI scans from six institutions, spanning both venous phase and out-of-phase sequences, enabling study of both cross-center and cross-sequence variations with pixel-accurate pancreas masks created by a double-blind, two-pass protocol. Through comprehensive analysis, we reveal three insights: (i) limited sampling introduces significant variance that may be mistaken for distribution shifts, (ii) cross-center performance correlates with source domain performance for identical sequences, and (iii) cross-sequence shifts require specialized solutions. We also propose a semi-supervised approach that leverages anatomical invariances, significantly outperforming state-of-the-art domain generalization techniques with 61.63% Dice score improvements and 87.00% on two test centers for cross-sequence segmentation. PancreasDG sets a new benchmark for domain generalization in medical imaging. Dataset, code, and models will be available at https://pancreasdg.netlify.app.
中文摘要:作者提出了PancreasDG这一大型多中心三维MRI数据集,旨在解决胰腺分割中的领域泛化问题,研究表明跨序列变化比跨中心差异带来更大挑战,并提出一种半监督方法显著优于现有技术。
English summary: The authors introduce PancreasDG, a large multi-center 3D MRI dataset addressing domain generalization challenges in pancreas segmentation, demonstrating that cross-sequence variations pose greater difficulties than cross-center differences and proposing a semi-supervised method that significantly outperforms existing techniques.
Authors:Daniel Son, Sanjana Rathore, Andrew Rufail, Adrian Simon, Daniel Zhang, Soham Dave, Cole Blondin, Kevin Zhu, Sean O'Brien
Abstract:
We investigate feature universality in Gemma-2 language models (Gemma-2-2B and Gemma-2-9B), asking whether models with a four-fold difference in scale still converge on comparable internal concepts. Using the Sparse Autoencoder (SAE) dictionary-learning pipeline, we utilize SAEs on each model's residual-stream activations, align the resulting monosemantic features via activation correlation, and compare the matched feature spaces with SVCCA and RSA. Middle layers yield the strongest overlap, while early and late layers show far less similarity. Preliminary experiments extend the analysis from single tokens to multi-token subspaces, showing that semantically similar subspaces interact similarly with language models. These results strengthen the case that large language models carve the world into broadly similar, interpretable features despite size differences, reinforcing universality as a foundation for cross-model interpretability.
中文摘要:研究表明,不同规模的Gemma-2语言模型在中间层形成了相似的内部特征表示,这证实了可解释特征在不同规模模型间的普遍性。
English Summary: The study demonstrates that Gemma-2 language models of different scales develop similar internal feature representations, particularly in middle layers, supporting the universality of interpretable features across model sizes.
Authors:Hongyi Pan, Gorkem Durak, Elif Keles, Deniz Seyithanoglu, Zheyuan Zhang, Alpay Medetalibeyoglu, Halil Ertugrul Aktas, Andrea Mia Bejar, Ziliang Hong, Yavuz Taktak, Gulbiz Dagoglu Kartal, Mehmet Sukru Erturk, Timurhan Cebeci, Maria Jaramillo Gonzalez, Yury Velichko, Lili Zhao, Emil Agarunov, Federica Proietto Salanitri, Concetto Spampinato, Pallavi Tiwari, Ziyue Xu, Sachin Jambawalikar, Ivo G. Schoots, Marco J. Bruno, Chenchang Huang, Candice Bolan, Tamas Gonda, Frank H. Miller, Rajesh N. Keswani, Michael B. Wallace, Ulas Bagci
Abstract:
Pancreatic cancer is projected to become the second-deadliest malignancy in Western countries by 2030, highlighting the urgent need for better early detection. Intraductal papillary mucinous neoplasms (IPMNs), key precursors to pancreatic cancer, are challenging to assess with current guidelines, often leading to unnecessary surgeries or missed malignancies. We present Cyst-X, an AI framework that predicts IPMN malignancy using multicenter MRI data, leveraging MRI's superior soft tissue contrast over CT. Trained on 723 T1- and 738 T2-weighted scans from 764 patients across seven institutions, our models (AUC=0.82) significantly outperform both Kyoto guidelines (AUC=0.75) and expert radiologists. The AI-derived imaging features align with known clinical markers and offer biologically meaningful insights. We also demonstrate strong performance in a federated learning setting, enabling collaborative training without sharing patient data. To promote privacy-preserving AI development and improve IPMN risk stratification, the Cyst-X dataset is released as the first large-scale, multi-center pancreatic cysts MRI dataset.
Chinese: 胰腺癌预计到2030年将成为西方国家第二大致命恶性肿瘤,凸显了对改进早期检测方法的迫切需求。
English: Pancreatic cancer is set to become the second-leading cause of cancer deaths in Western countries by 2030, underscoring the critical need for improved early detection methods.
Authors:McNair Shah, Saleena Angeline, Adhitya Rajendra Kumar, Naitik Chheda, Kevin Zhu, Vasu Sharma, Sean O'Brien, Will Cai
Abstract:
Recent advances in large language models (LLMs) have intensified the need to understand and reliably curb their harmful behaviours. We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank. We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction. We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility. Our findings advance the emerging view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.
Chinese: 本研究提出一个多维框架,通过定义低秩有害性子空间来识别和控制大语言模型中的有害内容,借助主导方向调控可在几乎不影响实用性的前提下有效消除有害行为。
English: The study introduces a multidimensional framework to identify and control harmful content in large language models by defining a low-rank harmfulness subspace, enabling effective mitigation of harmful behaviors with minimal impact on utility through dominant direction steering.
Authors:Lei Zhu, Jun Zhou, Rick Siow Mong Goh, Yong Liu
Abstract:
Foundation models have recently gained tremendous popularity in medical image analysis. State-of-the-art methods leverage either paired image-text data via vision-language pre-training or unpaired image data via self-supervised pre-training to learn foundation models with generalizable image features to boost downstream task performance. However, learning foundation models exclusively on either paired or unpaired image data limits their ability to learn richer and more comprehensive image features. In this paper, we investigate a novel task termed semi-supervised vision-language pre-training, aiming to fully harness the potential of both paired and unpaired image data for foundation model learning. To this end, we propose MaskedCLIP, a synergistic masked image modeling and contrastive language-image pre-training framework for semi-supervised vision-language pre-training. The key challenge in combining paired and unpaired image data for learning a foundation model lies in the incompatible feature spaces derived from these two types of data. To address this issue, we propose to connect the masked feature space with the CLIP feature space with a bridge transformer. In this way, the more semantic specific CLIP features can benefit from the more general masked features for semantic feature extraction. We further propose a masked knowledge distillation loss to distill semantic knowledge of original image features in CLIP feature space back to the predicted masked image features in masked feature space. With this mutually interactive design, our framework effectively leverages both paired and unpaired image data to learn more generalizable image features for downstream tasks. Extensive experiments on retinal image analysis demonstrate the effectiveness and data efficiency of our method.
中文: 医学影像中的基础模型通过MaskedCLIP这一半监督框架,协同利用配对与非配对图像数据,结合掩码建模和对比学习,有效提升了下游任务的特征泛化能力。
English: Foundation models in medical imaging are enhanced by MaskedCLIP, a semi-supervised framework that synergistically combines paired and unpaired image data through masked image modeling and contrastive learning, improving feature generalization for downstream tasks.
Authors:Boheng Li, Junjie Wang, Yiming Li, Zhiyang Hu, Leyi Qi, Jianshuo Dong, Run Wang, Han Qiu, Zhan Qin, Tianwei Zhang
Abstract:
Despite the integration of safety alignment and external filters, text-to-image (T2I) generative models are still susceptible to producing harmful content, such as sexual or violent imagery. This raises serious concerns about unintended exposure and potential misuse. Red teaming, which aims to proactively identify diverse prompts that can elicit unsafe outputs from the T2I system (including the core generative model as well as potential external safety filters and other processing components), is increasingly recognized as an essential method for assessing and improving safety before real-world deployment. Yet, existing automated red teaming approaches often treat prompt discovery as an isolated, prompt-level optimization task, which limits their scalability, diversity, and overall effectiveness. To bridge this gap, in this paper, we propose DREAM, a scalable red teaming framework to automatically uncover diverse problematic prompts from a given T2I system. Unlike most prior works that optimize prompts individually, DREAM directly models the probabilistic distribution of the target system's problematic prompts, which enables explicit optimization over both effectiveness and diversity, and allows efficient large-scale sampling after training. To achieve this without direct access to representative training samples, we draw inspiration from energy-based models and reformulate the objective into simple and tractable objectives. We further introduce GC-SPSA, an efficient optimization algorithm that provide stable gradient estimates through the long and potentially non-differentiable T2I pipeline. The effectiveness of DREAM is validated through extensive experiments, demonstrating that it surpasses 9 state-of-the-art baselines by a notable margin across a broad range of T2I models and safety filters in terms of prompt success rate and diversity.
Chinese: 尽管已有安全措施,文本到图像模型仍易生成有害内容,为此提出的DREAM框架通过建模问题提示的概率分布,高效发掘多样化风险提示,在成功率和多样性上显著超越现有方法。
English: Despite existing safety measures, text-to-image models remain vulnerable to generating harmful content, prompting the development of DREAM, a scalable red teaming framework that efficiently uncovers diverse problematic prompts by modeling their probabilistic distribution and outperforms current methods in both success rate and diversity.
Authors:Simon Kohaut, Felix Divo, Navid Hamid, Benedict Flade, Julian Eggert, Devendra Singh Dhami, Kristian Kersting
Abstract:
Ensuring reliable and rule-compliant behavior of autonomous agents in uncertain environments remains a fundamental challenge in modern robotics. Our work shows how neuro-symbolic systems, which integrate probabilistic, symbolic white-box reasoning models with deep learning methods, offer a powerful solution to this challenge. This enables the simultaneous consideration of explicit rules and neural models trained on noisy data, combining the strength of structured reasoning with flexible representations. To this end, we introduce the Constitutional Controller (CoCo), a novel framework designed to enhance the safety and reliability of agents by reasoning over deep probabilistic logic programs representing constraints such as those found in shared traffic spaces. Furthermore, we propose the concept of self-doubt, implemented as a probability density conditioned on doubt features such as travel velocity, employed sensors, or health factors. In a real-world aerial mobility study, we demonstrate CoCo's advantages for intelligent autonomous systems to learn appropriate doubts and navigate complex and uncertain environments safely and compliantly.
中文摘要:本文提出宪法控制器(CoCo)这一神经符号框架,通过融合概率推理与深度学习来增强自主代理的安全性,并引入自我怀疑机制使系统能在不确定环境中实现合规导航。
English Summary: The paper introduces the Constitutional Controller (CoCo), a neuro-symbolic framework that enhances autonomous agent safety by combining probabilistic reasoning with deep learning, incorporating self-doubt mechanisms to navigate uncertain environments compliantly.
Authors:Max van den Hoven, Kishaan Jeeveswaran, Pieter Piscaer, Thijs Wensveen, Elahe Arani, Bahram Zonooz
Abstract:
Monocular 3D lane detection is essential for autonomous driving, but challenging due to the inherent lack of explicit spatial information. Multi-modal approaches rely on expensive depth sensors, while methods incorporating fully-supervised depth networks rely on ground-truth depth data that is impractical to collect at scale. Additionally, existing methods assume that camera parameters are available, limiting their applicability in scenarios like crowdsourced high-definition (HD) lane mapping. To address these limitations, we propose Depth3DLane, a novel dual-pathway framework that integrates self-supervised monocular depth estimation to provide explicit structural information, without the need for expensive sensors or additional ground-truth depth data. Leveraging a self-supervised depth network to obtain a point cloud representation of the scene, our bird's-eye view pathway extracts explicit spatial information, while our front view pathway simultaneously extracts rich semantic information. Depth3DLane then uses 3D lane anchors to sample features from both pathways and infer accurate 3D lane geometry. Furthermore, we extend the framework to predict camera parameters on a per-frame basis and introduce a theoretically motivated fitting procedure to enhance stability on a per-segment basis. Extensive experiments demonstrate that Depth3DLane achieves competitive performance on the OpenLane benchmark dataset. Furthermore, experimental results show that using learned parameters instead of ground-truth parameters allows Depth3DLane to be applied in scenarios where camera calibration is infeasible, unlike previous methods.
中文: Depth3DLane提出了一种双通路框架,通过自监督深度估计实现无需昂贵传感器或真实深度数据的单目3D车道检测,并能预测相机参数以增强在无标定场景下的适用性。
English: Depth3DLane introduces a dual-pathway framework that integrates self-supervised depth estimation to enable monocular 3D lane detection without relying on expensive sensors or ground-truth data, while also extending to predict camera parameters for broader applicability.
Authors:Cheng-Ting Chou, George Liu, Jessica Sun, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien
Abstract:
Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are available. In this work, we investigate whether sparse autoencoder (SAE) features, previously shown to correlate with interpretable model behaviors, can be leveraged to steer the generated language of LLMs during inference. Leveraging pretrained SAEs on the residual streams of Gemma-2B and Gemma-9B, we identify features whose activations differ most significantly between English and four target languages: Chinese, Japanese, Spanish, and French. By modifying just a single SAE feature at one transformer layer, we achieve controlled language shifts with up to 90\% success, as measured by FastText language classification, while preserving semantic fidelity according to LaBSE (Language-Agnostic BERT Sentence Embedding) similarity. Our analysis reveals that language steering is most effective in mid-to-late transformer layers and is amplified by specific attention heads disproportionately associated with language-sensitive SAE features. These results demonstrate the promise of sparse feature steering as a lightweight and interpretable mechanism for controllable multilingual generation.
中文摘要:本研究证明,通过修改多语言大模型中单个稀疏自编码器特征,可在保持语义完整性的同时有效引导生成文本转向目标语言,尤其在模型中后期转换层效果最为显著。
English Summary: This study demonstrates that modifying a single sparse autoencoder feature in multilingual language models can effectively steer generated text to target languages with high accuracy while maintaining semantic integrity, particularly in mid-to-late transformer layers.
Authors:Ishant Chintapatla, Kazuma Choji, Naaisha Agarwal, Andrew Lin, Hannah You, Charles Duong, Kevin Zhu, Sean O'Brien, Vasu Sharma
Abstract:
Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model's ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs' ability to reason over certain types of image-question pairs in crowded scenes.
Chinese: 近期视觉语言模型(VLM)的基准多集中于视觉问答,却忽视了视觉蕴含能力,为此我们提出了COREVQA基准,利用拥挤图像测试模型,结果显示即使最优模型准确率也低于80%,暴露了VLM在复杂场景推理中的严重不足。
English: Recent benchmarks for Vision-Language Models (VLMs) focus on visual question answering but overlook visual entailment, prompting the introduction of COREVQA, a new benchmark using crowded images that reveals VLMs' significant limitations in reasoning, with top models scoring below 80% accuracy.
Authors:Yuhang Lu, Jiadong Tu, Yuexin Ma, Xinge Zhu
Abstract:
End-to-end autonomous driving has emerged as a promising approach to unify perception, prediction, and planning within a single framework, reducing information loss and improving adaptability. However, existing methods often rely on fixed and sparse trajectory supervision, limiting their ability to capture the hierarchical reasoning process that human drivers naturally employ. To bridge this gap, we propose ReAL-AD, a Reasoning-Augmented Learning framework that structures decision-making in autonomous driving based on the three-tier human cognitive model: Driving Strategy, Driving Decision, and Driving Operation, where Vision-Language Models (VLMs) are incorporated to enhance situational awareness and structured reasoning across these levels. Specifically, we introduce: (1) the Strategic Reasoning Injector, which formulates high-level driving strategies by interpreting complex traffic contexts from VLM-generated insights; (2) the Tactical Reasoning Integrator, which refines strategic intent into interpretable tactical choices such as lane changes, overtaking, and speed adjustments; and (3) the Hierarchical Trajectory Decoder, which progressively translates tactical decisions into precise control actions for smooth and human-like trajectory execution. Extensive evaluations show that integrating our framework improves planning accuracy and safety by over 30%, making end-to-end autonomous driving more interpretable and aligned with human-like hierarchical reasoning. The project page can be found at: \href{https://4dvlab.github.io/project_page/realad}{\texttt{4dvlab.github.io/project\_page/realad}}
中文: 提出的ReAL-AD框架通过视觉语言模型融入类人分层推理,将自动驾驶决策分为战略、战术和操作三个层级,使规划准确性和安全性提升超过30%。
English: The proposed ReAL-AD framework enhances autonomous driving by incorporating human-like hierarchical reasoning through vision-language models, improving planning accuracy and safety by over 30%.
Authors:Konstantinos I. Roumeliotis, Ranjan Sapkota, Manoj Karkee, Nikolaos D. Tselikas
Abstract:
Modern Artificial Intelligence (AI) increasingly relies on multi-agent architectures that blend visual and language understanding. Yet, a pressing challenge remains: How can we trust these agents especially in zero-shot settings with no fine-tuning? We introduce a novel modular Agentic AI visual classification framework that integrates generalist multimodal agents with a non-visual reasoning orchestrator and a Retrieval-Augmented Generation (RAG) module. Applied to apple leaf disease diagnosis, we benchmark three configurations: (I) zero-shot with confidence-based orchestration, (II) fine-tuned agents with improved performance, and (III) trust-calibrated orchestration enhanced by CLIP-based image retrieval and re-evaluation loops. Using confidence calibration metrics (ECE, OCR, CCC), the orchestrator modulates trust across agents. Our results demonstrate a 77.94\% accuracy improvement in the zero-shot setting using trust-aware orchestration and RAG, achieving 85.63\% overall. GPT-4o showed better calibration, while Qwen-2.5-VL displayed overconfidence. Furthermore, image-RAG grounded predictions with visually similar cases, enabling correction of agent overconfidence via iterative re-evaluation. The proposed system separates perception (vision agents) from meta-reasoning (orchestrator), enabling scalable and interpretable multi-agent AI. This blueprint illustrates how Agentic AI can deliver trustworthy, modular, and transparent reasoning, and is extensible to diagnostics, biology, and other trust-critical domains. In doing so, we highlight Agentic AI not just as an architecture but as a paradigm for building reliable multi-agent intelligence. agentic ai, orchestrator agent trust, trust orchestration, visual classification, retrieval augmented reasoning
中文摘要:本研究提出模块化智能体AI框架,通过融合多模态智能体与推理协调器及检索增强生成模块,在苹果叶病诊断中实现85.63%准确率,验证了可信协调机制能有效提升零样本场景下的系统可靠性。
English Summary: The study presents a modular Agentic AI framework that enhances trust in multi-agent systems by integrating multimodal agents with a reasoning orchestrator and RAG module, achieving 85.63% accuracy in apple leaf disease diagnosis through trust-calibrated orchestration.
Authors:Lihua Zhou, Mao Ye, Nianxin Li, Shuaifeng Li, Jinlin Wu, Xiatian Zhu, Lei Deng, Hongbin Liu, Jiebo Luo, Zhen Lei
Abstract:
Deep learning often struggles when training and test data distributions differ. Traditional domain generalization (DG) tackles this by including data from multiple source domains, which is impractical due to expensive data collection and annotation. Recent vision-language models like CLIP enable source-free domain generalization (SFDG) by using text prompts to simulate visual representations, reducing data demands. However, existing SFDG methods struggle with domain-specific confounders, limiting their generalization capabilities. To address this issue, we propose TDCRL (\textbf{T}ext-\textbf{D}riven \textbf{C}ausal \textbf{R}epresentation \textbf{L}earning), the first method to integrate causal inference into the SFDG setting. TDCRL operates in two steps: first, it employs data augmentation to generate style word vectors, combining them with class information to generate text embeddings to simulate visual representations; second, it trains a causal intervention network with a confounder dictionary to extract domain-invariant features. Grounded in causal learning, our approach offers a clear and effective mechanism to achieve robust, domain-invariant features, ensuring robust generalization. Extensive experiments on PACS, VLCS, OfficeHome, and DomainNet show state-of-the-art performance, proving TDCRL effectiveness in SFDG.
中文摘要:TDCRL提出了一种新颖的文本驱动因果表征学习方法,通过生成文本嵌入模拟视觉特征并训练因果干预网络,有效解决了无源域泛化中的领域特定混杂因素问题,实现了鲁棒的领域不变表征。
English Summary: TDCRL introduces a novel text-driven causal representation learning method that overcomes domain-specific confounders in source-free domain generalization by generating text embeddings to simulate visual features and training a causal intervention network for robust, domain-invariant representations.
Authors:Yongheng Zhang, Xu Liu, Ruihan Tao, Qiguang Chen, Hao Fei, Wanxiang Che, Libo Qin
Abstract:
Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning, and is fundamental to applications such as autonomous driving, embodied AI, and the broader pursuit of AGI. The rapid development of large language models (LLMs), particularly those utilizing Chain-of-Thought (CoT) technology, has significantly advanced video reasoning capabilities. However, current approaches primarily depend on textual information for reasoning, overlooking the visual modality in the actual video reasoning process. In contrast, humans naturally re-examine visual content while reasoning. Motivated by this, we introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive and cognitively aligned reasoning. To the end, first, we construct the Video-Text Interleaved Benchmark (ViTIB), which is created using MLLMs for key-video selection and manually verified. Furthermore, we extensively explore the potential of the ViTCoT paradigm in the video understanding field. Extensive experiments demonstrate that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm and effectively activates more neuron values in MLLMs.
中文摘要:ViTCoT范式通过交错式思维链处理将视觉与文本推理相结合,相比纯文本方法显著提升了视频理解性能,并有效激活了多模态大语言模型中更多神经元。
English Summary: The ViTCoT paradigm integrates visual and textual reasoning through interleaved chain-of-thought processing, significantly outperforming text-only approaches by activating more neural pathways in multimodal large language models.
Authors:Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min
Abstract:
We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners, leading to an average 41% relative improvement while allowing users to opt out of certain data based on data licensing or permission requirements. Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, this research presents a solution for both data owners and researchers in regulated industries with sensitive or protected data. FlexOlmo enables benefiting from closed data while respecting data owners' preferences by keeping their data local and supporting fine-grained control of data access during inference.
中文摘要:FlexOlmo是一种新型语言模型架构,支持在不共享数据的情况下进行分布式训练,并能通过灵活组合不同领域的专家模块实现可配置推理,在提升模型性能的同时有效保护数据隐私。
English Summary: FlexOlmo is a novel language model architecture that enables distributed training on private datasets without data sharing and supports flexible inference by allowing selective inclusion of domain-specific experts, achieving significant performance improvements while maintaining data privacy.
Authors:Maolin Wang, Jun Chu, Sicong Xie, Xiaoling Zang, Yao Zhao, Wenliang Zhong, Xiangyu Zhao
Abstract:
In the era of mobile computing, deploying efficient Natural Language Processing (NLP) models in resource-restricted edge settings presents significant challenges, particularly in environments requiring strict privacy compliance, real-time responsiveness, and diverse multi-tasking capabilities. These challenges create a fundamental need for ultra-compact models that maintain strong performance across various NLP tasks while adhering to stringent memory constraints. To this end, we introduce Edge ultra-lIte BERT framework (EI-BERT) with a novel cross-distillation method. EI-BERT efficiently compresses models through a comprehensive pipeline including hard token pruning, cross-distillation and parameter quantization. Specifically, the cross-distillation method uniquely positions the teacher model to understand the student model's perspective, ensuring efficient knowledge transfer through parameter integration and the mutual interplay between models. Through extensive experiments, we achieve a remarkably compact BERT-based model of only 1.91 MB - the smallest to date for Natural Language Understanding (NLU) tasks. This ultra-compact model has been successfully deployed across multiple scenarios within the Alipay ecosystem, demonstrating significant improvements in real-world applications. For example, it has been integrated into Alipay's live Edge Recommendation system since January 2024, currently serving the app's recommendation traffic across \textbf{8.4 million daily active devices}.
中文: EI-BERT框架通过交叉蒸馏技术开发出仅1.91MB的超轻量BERT模型,专为边缘计算环境设计,已在支付宝推荐系统中成功部署,每日服务840万台活跃设备。
English: The EI-BERT framework introduces cross-distillation to create a 1.91MB ultra-compact BERT model for efficient NLP deployment in edge environments, successfully implemented in Alipay's recommendation system serving millions of devices.
Authors:Guohong Liu, Jialei Ye, Jiacheng Liu, Yuanchun Li, Wei Liu, Pengzhi Gao, Jian Luan, Yunxin Liu
Abstract:
Mobile GUI agents are designed to autonomously execute diverse device-control tasks by interpreting and interacting with mobile screens. Despite notable advancements, their resilience in real-world scenarios where screen content may be partially manipulated by untrustworthy third parties remains largely unexplored. Owing to their black-box and autonomous nature, these agents are vulnerable to manipulations that could compromise user devices. In this work, we present the first systematic investigation into the vulnerabilities of mobile GUI agents. We introduce a scalable attack simulation framework AgentHazard, which enables flexible and targeted modifications of screen content within existing applications. Leveraging this framework, we develop a comprehensive benchmark suite comprising both a dynamic task execution environment and a static dataset of vision-language-action tuples, totaling over 3,000 attack scenarios. The dynamic environment encompasses 58 reproducible tasks in an emulator with various types of hazardous UI content, while the static dataset is constructed from 210 screenshots collected from 14 popular commercial apps. Importantly, our content modifications are designed to be feasible for unprivileged third parties. We evaluate 7 widely-used mobile GUI agents and 5 common backbone models using our benchmark. Our findings reveal that all examined agents are significantly influenced by misleading third-party content (with an average misleading rate of 28.8% in human-crafted attack scenarios) and that their vulnerabilities are closely linked to the employed perception modalities and backbone LLMs. Furthermore, we assess training-based mitigation strategies, highlighting both the challenges and opportunities for enhancing the robustness of mobile GUI agents. Our code and data will be released at https://agenthazard.github.io.
中文: 本研究提出首个系统性评估移动GUI代理漏洞的框架AgentHazard,揭示了其对第三方屏幕操控的易受攻击性,并探索了增强鲁棒性的解决方案。
English: This study introduces AgentHazard, the first systematic framework for evaluating vulnerabilities in mobile GUI agents, revealing their susceptibility to third-party screen manipulations and proposing mitigation strategies.
Authors:Ranjan Sapkota, Zhichao Meng, Martin Churuvija, Xiaoqiang Du, Zenghong Ma, Manoj Karkee
Abstract:
In orchard automation, dense foliage during the canopy season severely occludes tree structures, minimizing visibility to various canopy parts such as trunks and branches, which limits the ability of a machine vision system. However, canopy structure is more open and visible during the dormant season when trees are defoliated. In this work, we present an information fusion framework that integrates multi-seasonal structural data to support robotic and automated crop load management during the entire growing season. The framework combines high-resolution RGB-D imagery from both dormant and canopy periods using YOLOv9-Seg for instance segmentation, Kinect Fusion for 3D reconstruction, and Fast Generalized Iterative Closest Point (Fast GICP) for model alignment. Segmentation outputs from YOLOv9-Seg were used to extract depth-informed masks, which enabled accurate 3D point cloud reconstruction via Kinect Fusion; these reconstructed models from each season were subsequently aligned using Fast GICP to achieve spatially coherent multi-season fusion. The YOLOv9-Seg model, trained on manually annotated images, achieved a mean squared error (MSE) of 0.0047 and segmentation mAP@50 scores up to 0.78 for trunks in dormant season dataset. Kinect Fusion enabled accurate reconstruction of tree geometry, validated with field measurements resulting in root mean square errors (RMSE) of 5.23 mm for trunk diameter, 4.50 mm for branch diameter, and 13.72 mm for branch spacing. Fast GICP achieved precise cross-seasonal registration with a minimum fitness score of 0.00197, allowing integrated, comprehensive tree structure modeling despite heavy occlusions during the growing season. This fused structural representation enables robotic systems to access otherwise obscured architectural information, improving the precision of pruning, thinning, and other automated orchard operations.
中文: 本研究提出一种多季节信息融合框架,通过整合休眠期和生长期的RGB-D图像数据,结合实例分割与点云配准技术,有效克服枝叶遮挡问题,为全生长季的果园机器人精准作业提供支持。
English: This study introduces a multi-seasonal information fusion framework that integrates 3D structural data from both dormant and canopy seasons using RGB-D imagery, instance segmentation, and point cloud alignment to overcome foliage occlusion and enable precise robotic orchard management throughout the growing season.
Authors:Imran Mirza, Cole Huang, Ishwara Vasista, Rohan Patil, Asli Akalin, Sean O'Brien, Kevin Zhu
Abstract:
Multi-agent systems, which consist of multiple AI models interacting within a shared environment, are increasingly used for persona-based interactions. However, if not carefully designed, these systems can reinforce implicit biases in large language models (LLMs), raising concerns about fairness and equitable representation. We present MALIBU, a novel benchmark developed to assess the degree to which LLM-based multi-agent systems implicitly reinforce social biases and stereotypes. MALIBU evaluates bias in LLM-based multi-agent systems through scenario-based assessments. AI models complete tasks within predefined contexts, and their responses undergo evaluation by an LLM-based multi-agent judging system in two phases. In the first phase, judges score responses labeled with specific demographic personas (e.g., gender, race, religion) across four metrics. In the second phase, judges compare paired responses assigned to different personas, scoring them and selecting the superior response. Our study quantifies biases in LLM-generated outputs, revealing that bias mitigation may favor marginalized personas over true neutrality, emphasizing the need for nuanced detection, balanced fairness strategies, and transparent evaluation benchmarks in multi-agent systems.
中文: MALIBU是一种新颖的基准测试工具,通过情景化评估衡量基于大语言模型的多智能体系统对社会偏见的隐性强化程度,研究发现偏见缓解可能偏向边缘化群体,强调需要更细致的公平性策略和透明评估标准。
English: MALIBU is a novel benchmark designed to evaluate how LLM-based multi-agent systems reinforce social biases through scenario-based assessments, revealing that bias mitigation may favor marginalized personas and highlighting the need for nuanced fairness strategies.
Authors:Rizwan Qureshi, Ranjan Sapkota, Abbas Shah, Amgad Muneer, Anas Zafar, Ashmal Vayani, Maged Shoman, Abdelrahman B. M. Eldaly, Kai Zhang, Ferhat Sadak, Shaina Raza, Xinqi Fan, Ravid Shwartz-Ziv, Hong Yan, Vinjia Jain, Aman Chadha, Manoj Karkee, Jia Wu, Seyedali Mirjalili
Abstract:
Can machines truly think, reason and act in domains like humans? This enduring question continues to shape the pursuit of Artificial General Intelligence (AGI). Despite the growing capabilities of models such as GPT-4.5, DeepSeek, Claude 3.5 Sonnet, Phi-4, and Grok 3, which exhibit multimodal fluency and partial reasoning, these systems remain fundamentally limited by their reliance on token-level prediction and lack of grounded agency. This paper offers a cross-disciplinary synthesis of AGI development, spanning artificial intelligence, cognitive neuroscience, psychology, generative models, and agent-based systems. We analyze the architectural and cognitive foundations of general intelligence, highlighting the role of modular reasoning, persistent memory, and multi-agent coordination. In particular, we emphasize the rise of Agentic RAG frameworks that combine retrieval, planning, and dynamic tool use to enable more adaptive behavior. We discuss generalization strategies, including information compression, test-time adaptation, and training-free methods, as critical pathways toward flexible, domain-agnostic intelligence. Vision-Language Models (VLMs) are reexamined not just as perception modules but as evolving interfaces for embodied understanding and collaborative task completion. We also argue that true intelligence arises not from scale alone but from the integration of memory and reasoning: an orchestration of modular, interactive, and self-improving components where compression enables adaptive behavior. Drawing on advances in neurosymbolic systems, reinforcement learning, and cognitive scaffolding, we explore how recent architectures begin to bridge the gap between statistical learning and goal-directed cognition. Finally, we identify key scientific, technical, and ethical challenges on the path to AGI.
中文: 尽管先进AI模型展现出强大能力,但受限于符号预测和缺乏自主行动力,实现通用人工智能仍需融合记忆、推理与模块化系统。
English: While advanced AI models show impressive capabilities, they remain limited by token-level prediction and lack true agency, requiring integration of memory, reasoning, and modular systems to achieve Artificial General Intelligence.
Authors:Sixiao Zhang, Mingrui Liu, Cheng Long, Wei Yuan, Hongxu Chen, Xiangyu Zhao, Hongzhi Yin
Abstract:
Conversational recommender systems (CRSs) capture user preference through textual information in dialogues. However, they suffer from data sparsity on two fronts: the dialogue space is vast and linguistically diverse, while the item space exhibits long-tail and sparse distributions. Existing methods struggle with (1) generalizing to varied dialogue expressions due to underutilization of rich textual cues, and (2) learning informative item representations under severe sparsity. To address these problems, we propose a CRS model named DACRS. It consists of three modules, namely Dialogue Augmentation, Knowledge-Guided Entity Modeling, and Dialogue-Entity Matching. In the Dialogue Augmentation module, we apply a two-stage augmentation pipeline to augment the dialogue context to enrich the data and improve generalizability. In the Knowledge-Guided Entity Modeling, we propose a knowledge graph (KG) based entity substitution and an entity similarity constraint to enhance the expressiveness of entity embeddings. In the Dialogue-Entity Matching module, we fuse the dialogue embedding with the mentioned entity embeddings through a dialogue-guided attention aggregation to acquire user embeddings that contain both the explicit and implicit user preferences. Extensive experiments on two public datasets demonstrate the state-of-the-art performance of DACRS.
Chinese: DACRS模型通过对话增强、知识引导的实体建模和对话-实体匹配三个模块,有效缓解对话推荐系统中的数据稀疏问题,提升对话泛化能力和实体表示效果。
English: The DACRS model addresses data sparsity in conversational recommender systems by employing dialogue augmentation, knowledge-guided entity modeling, and dialogue-entity matching to enhance generalization and item representation learning.
Authors:Yang Zhou, Chrystie Wan Ning Quek, Jun Zhou, Yan Wang, Yang Bai, Yuhe Ke, Jie Yao, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting
Abstract:
Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundation model trained using self-supervised learning and a memory module. MerMED-FM was trained on 3.3 million medical images from over ten specialties and seven modalities, including computed tomography (CT), chest X-rays (CXR), ultrasound (US), pathology patches, color fundus photography (CFP), optical coherence tomography (OCT) and dermatology images. MerMED-FM was evaluated across multiple diseases and compared against existing foundational models. Strong performance was achieved across all modalities, with AUROCs of 0.988 (OCT); 0.982 (pathology); 0.951 (US); 0.943 (CT); 0.931 (skin); 0.894 (CFP); 0.858 (CXR). MerMED-FM has the potential to be a highly adaptable, versatile, cross-specialty foundation model that enables robust medical imaging interpretation across diverse medical disciplines.
中文: MerMED-FM是一种先进的多模态、多专科基础模型,通过自监督学习在330万张医学图像上训练而成,在七种影像模态和多种疾病诊断中均展现出卓越的准确率。
English: MerMED-FM is a state-of-the-art multimodal, multi-specialty foundation model trained on 3.3 million medical images using self-supervised learning, achieving strong diagnostic accuracy across seven imaging modalities and multiple diseases.
Authors:Yang Wang, Chenghao Xiao, Yizhi Li, Stuart E. Middleton, Noura Al Moubayed, Chenghua Lin
Abstract:
Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.
Chinese Summary: 本研究提出一种简单的附加模块,通过移除实例级主成分来增强预训练语言模型的对抗鲁棒性,无需依赖传统对抗防御或扰动原始数据即可实现有效防护。
English Summary: This study introduces a simple add-on module that enhances the adversarial robustness of pre-trained language models by removing instance-level principal components, achieving improved defense without relying on conventional adversarial training or perturbing data.
Authors:Yang Wang, Chenghao Xiao, Yizhi Li, Stuart E. Middleton, Noura Al Moubayed, Chenghua Lin
Abstract:
Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.
Chinese Summary: 本研究提出一种简单的附加模块,通过移除实例级主成分来增强预训练语言模型的对抗鲁棒性,无需依赖传统对抗防御或扰动原始数据即可实现有效防护。
English Summary: This study introduces a simple add-on module that enhances the adversarial robustness of pre-trained language models by removing instance-level principal components, achieving improved defense without relying on conventional adversarial training or perturbing data.
Authors:Yixin Liu, Guibin Zhang, Kun Wang, Shiyuan Li, Shirui Pan
Abstract:
Autonomous agents based on large language models (LLMs) have demonstrated impressive capabilities in a wide range of applications, including web navigation, software development, and embodied control. While most LLMs are limited in several key agentic procedures, such as reliable planning, long-term memory, tool management, and multi-agent coordination, graphs can serve as a powerful auxiliary structure to enhance structure, continuity, and coordination in complex agent workflows. Given the rapid growth and fragmentation of research on Graph-augmented LLM Agents (GLA), this paper offers a timely and comprehensive overview of recent advances and also highlights key directions for future work. Specifically, we categorize existing GLA methods by their primary functions in LLM agent systems, including planning, memory, and tool usage, and then analyze how graphs and graph learning algorithms contribute to each. For multi-agent systems, we further discuss how GLA solutions facilitate the orchestration, efficiency optimization, and trustworthiness of MAS. Finally, we highlight key future directions to advance this field, from improving structural adaptability to enabling unified, scalable, and multimodal GLA systems. We hope this paper can serve as a roadmap for future research on GLA and foster a deeper understanding of the role of graphs in LLM agent systems.
中文: 本文全面综述了图增强大语言模型智能体(GLA)的研究进展,分析了图结构在智能体规划、记忆和工具使用中的增强作用,并探讨了GLA在多智能体系统中的应用及未来发展方向。
English: This paper provides a comprehensive overview of Graph-augmented LLM Agents (GLA), analyzing how graphs enhance agent capabilities in planning, memory, and tool usage while also discussing their role in multi-agent systems and outlining future research directions.
Authors:Chao Wang, Kejiang Chen, Zijin Yang, Yaofei Wang, Weiming Zhang
Abstract:
The rapid advancement of image-generation technologies has made it possible for anyone to create photorealistic images using generative models, raising significant security concerns. To mitigate malicious use, tracing the origin of such images is essential. Reconstruction-based attribution methods offer a promising solution, but they often suffer from reduced accuracy and high computational costs when applied to state-of-the-art (SOTA) models. To address these challenges, we propose AEDR (AutoEncoder Double-Reconstruction), a novel training-free attribution method designed for generative models with continuous autoencoders. Unlike existing reconstruction-based approaches that rely on the value of a single reconstruction loss, AEDR performs two consecutive reconstructions using the model's autoencoder, and adopts the ratio of these two reconstruction losses as the attribution signal. This signal is further calibrated using the image homogeneity metric to improve accuracy, which inherently cancels out absolute biases caused by image complexity, with autoencoder-based reconstruction ensuring superior computational efficiency. Experiments on eight top latent diffusion models show that AEDR achieves 25.5% higher attribution accuracy than existing reconstruction-based methods, while requiring only 1% of the computational time.
中文摘要:提出的AEDR方法通过双重重建损失比和同质性校准提升图像溯源精度,相比现有方法实现25.5%的准确率提升,同时计算时间减少99%。
English Summary: The proposed AEDR method enhances image attribution accuracy by using a double-reconstruction loss ratio and homogeneity calibration, achieving 25.5% higher accuracy with 99% less computational time compared to existing methods.
Authors:Junda Wu, Jessica Echterhoff, Kyungtae Han, Amr Abdelraouf, Rohit Gupta, Julian McAuley
Abstract:
Understanding a driver's behavior and intentions is important for potential risk assessment and early accident prevention. Safety and driver assistance systems can be tailored to individual drivers' behavior, significantly enhancing their effectiveness. However, existing datasets are limited in describing and explaining general vehicle movements based on external visual evidence. This paper introduces a benchmark, PDB-Eval, for a detailed understanding of Personalized Driver Behavior, and aligning Large Multimodal Models (MLLMs) with driving comprehension and reasoning. Our benchmark consists of two main components, PDB-X and PDB-QA. PDB-X can evaluate MLLMs' understanding of temporal driving scenes. Our dataset is designed to find valid visual evidence from the external view to explain the driver's behavior from the internal view. To align MLLMs' reasoning abilities with driving tasks, we propose PDB-QA as a visual explanation question-answering task for MLLM instruction fine-tuning. As a generic learning task for generative models like MLLMs, PDB-QA can bridge the domain gap without harming MLLMs' generalizability. Our evaluation indicates that fine-tuning MLLMs on fine-grained descriptions and explanations can effectively bridge the gap between MLLMs and the driving domain, which improves zero-shot performance on question-answering tasks by up to 73.2%. We further evaluate the MLLMs fine-tuned on PDB-X in Brain4Cars' intention prediction and AIDE's recognition tasks. We observe up to 12.5% performance improvements on the turn intention prediction task in Brain4Cars, and consistent performance improvements up to 11.0% on all tasks in AIDE.
中文摘要:本文提出PDB-Eval基准,通过视觉推理增强大语言模型对个性化驾驶行为的理解,实验表明微调后模型在驾驶任务中的性能获得显著提升。
English Summary: This paper introduces PDB-Eval, a benchmark for enhancing Large Multimodal Models' understanding of personalized driver behavior through visual reasoning, demonstrating significant performance improvements in driving-related tasks after fine-tuning.
Authors:Anna Ivagnes, Toby van Gastelen, Syver Døving Agdestein, Benjamin Sanderse, Giovanni Stabile, Gianluigi Rozza
Abstract:
We present a novel approach to define the filter and relax steps in the evolve-filter-relax (EFR) framework for simulating turbulent flows. The EFR main advantages are its ease of implementation and computational efficiency. However, as it only contains two parameters (one for the filter step and one for the relax step) its flexibility is rather limited. In this work, we propose a data-driven approach in which the optimal filter is found based on DNS data in the frequency domain. The optimization step is computationally efficient and only involves one-dimensional least-squares problems for each wavenumber. Across both decaying turbulence and Kolmogorov flow, our learned filter decisively outperforms the standard differential filter and the Smagorinsky model, yielding significantly improved accuracy in energy spectra and in the temporal evolution of both energy and enstrophy. In addition, the relax parameter is determined by requiring energy and/or enstrophy conservation, which enforces stability of the method and reduces the appearance of numerical wiggles, especially when the filter is built in scarce data regimes. Applying the learned filter is also more computationally efficient compared to traditional differential filters, as it circumvents solving a linear system.
中文: 本研究提出了一种数据驱动方法,利用直接数值模拟数据优化演化-过滤-松弛框架中的过滤器,在提高湍流模拟精度和计算效率的同时,通过守恒原理确保数值稳定性。
English: This study introduces a data-driven method to optimize the filter in the evolve-filter-relax framework using DNS data, enhancing accuracy and computational efficiency in turbulent flow simulations while ensuring stability through conservation principles.
Authors:Hao Gu, Rui Zhong, Yu Xia, Wei Yang, Chi Lu, Peng Jiang, Kun Gai
Abstract:
Harnessing Large Language Models (LLMs) for recommendation systems has emerged as a prominent avenue, drawing substantial research interest. However, existing approaches primarily involve basic prompt techniques for knowledge acquisition, which resemble System-1 thinking. This makes these methods highly sensitive to errors in the reasoning path, where even a small mistake can lead to an incorrect inference. To this end, in this paper, we propose $R^{4}$ec, a reasoning, reflection and refinement framework that evolves the recommendation system into a weak System-2 model. Specifically, we introduce two models: an actor model that engages in reasoning, and a reflection model that judges these responses and provides valuable feedback. Then the actor model will refine its response based on the feedback, ultimately leading to improved responses. We employ an iterative reflection and refinement process, enabling LLMs to facilitate slow and deliberate System-2-like thinking. Ultimately, the final refined knowledge will be incorporated into a recommendation backbone for prediction. We conduct extensive experiments on Amazon-Book and MovieLens-1M datasets to demonstrate the superiority of $R^{4}$ec. We also deploy $R^{4}$ec on a large scale online advertising platform, showing 2.2\% increase of revenue. Furthermore, we investigate the scaling properties of the actor model and reflection model.
中文摘要:本文提出$R^{4}$ec框架,通过推理、反思和优化机制将推荐系统升级为弱系统-2模型,采用迭代反馈提升回答质量,实验证明其优越性并在在线广告平台实现2.2%的收益增长。
English Summary: The paper introduces $R^{4}$ec, a reasoning, reflection, and refinement framework that enhances recommendation systems by evolving LLMs into weak System-2 models through iterative feedback, demonstrating improved performance in experiments and a 2.2% revenue increase in online deployment.
Authors:Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Zhuangzhuang He, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, Jie Fu
Abstract:
Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.
中文摘要:现有基于非正式语言的大型语言模型面临验证不可靠且难以扩展的挑战,而将其建立在如Dafny等形式化语言基础上,能实现自动化的数学可验证推理,并通过专门训练流程使小模型在形式化编程任务中超越大型专有模型。
English Summary: Current large language models struggle with unreliable and unscalable verification in informal language settings, but grounding them in formal languages like Dafny enables automatic, mathematically provable verification and improves performance even for small models through specialized training pipelines.
Authors:Yulin Chen, Haoran Li, Yuexin Li, Yue Liu, Yangqiu Song, Bryan Hooi
Abstract:
Large language models (LLMs) have shown remarkable performance across a range of NLP tasks. However, their strong instruction-following capabilities and inability to distinguish instructions from data content make them vulnerable to indirect prompt injection attacks. In such attacks, instructions with malicious purposes are injected into external data sources, such as web documents. When LLMs retrieve this injected data through tools, such as a search engine and execute the injected instructions, they provide misled responses. Recent attack methods have demonstrated potential, but their abrupt instruction injection often undermines their effectiveness. Motivated by the limitations of existing attack methods, we propose TopicAttack, which prompts the LLM to generate a fabricated conversational transition prompt that gradually shifts the topic toward the injected instruction, making the injection smoother and enhancing the plausibility and success of the attack. Through comprehensive experiments, TopicAttack achieves state-of-the-art performance, with an attack success rate (ASR) over 90\% in most cases, even when various defense methods are applied. We further analyze its effectiveness by examining attention scores. We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.
中文: 大型语言模型易受间接提示注入攻击,而提出的TopicAttack方法通过平滑转换话题增强了攻击效果,即使在防御措施下成功率仍超过90%。
English: Large language models are vulnerable to indirect prompt injection attacks, and the proposed TopicAttack method enhances attack success by smoothly transitioning topics, achieving over 90% success rate even against defenses.
Authors:Yulin Chen, Haoran Li, Yuexin Li, Yue Liu, Yangqiu Song, Bryan Hooi
Abstract:
Large language models (LLMs) have shown remarkable performance across a range of NLP tasks. However, their strong instruction-following capabilities and inability to distinguish instructions from data content make them vulnerable to indirect prompt injection attacks. In such attacks, instructions with malicious purposes are injected into external data sources, such as web documents. When LLMs retrieve this injected data through tools, such as a search engine and execute the injected instructions, they provide misled responses. Recent attack methods have demonstrated potential, but their abrupt instruction injection often undermines their effectiveness. Motivated by the limitations of existing attack methods, we propose TopicAttack, which prompts the LLM to generate a fabricated conversational transition prompt that gradually shifts the topic toward the injected instruction, making the injection smoother and enhancing the plausibility and success of the attack. Through comprehensive experiments, TopicAttack achieves state-of-the-art performance, with an attack success rate (ASR) over 90\% in most cases, even when various defense methods are applied. We further analyze its effectiveness by examining attention scores. We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.
中文: 大型语言模型易受间接提示注入攻击,而提出的TopicAttack方法通过平滑转换话题增强了攻击效果,即使在防御措施下成功率仍超过90%。
English: Large language models are vulnerable to indirect prompt injection attacks, and the proposed TopicAttack method enhances attack success by smoothly transitioning topics, achieving over 90% success rate even against defenses.
Authors:Suorong Yang, Peijia Li, Yujie Liu, Zhiming Xu, Peng Ye, Wanli Ouyang, Furao Shen, Dongzhan Zhou
Abstract:
Modern deep models are trained on large real-world datasets, where data quality varies and redundancy is common. Data-centric approaches such as dataset pruning have shown promise in improving training efficiency and model performance. However, most existing methods rely on static heuristics or task-specific metrics, limiting their robustness and generalizability across domains. In this work, we introduce a dynamic dataset pruning framework that adaptively selects training samples based on both task-driven difficulty and cross-modality semantic consistency. By incorporating supervision from pretrained multimodal foundation models, our approach captures training dynamics while effectively filtering out uninformative samples. Our work highlights the potential of integrating cross-modality alignment for robust sample selection, advancing data-centric learning toward more efficient and robust practices across application domains.
中文摘要:本文提出了一种动态数据集剪枝框架,通过结合任务驱动的难度评估和跨模态语义一致性来自适应选择训练样本,利用预训练多模态模型提升跨领域训练的效率和鲁棒性。
English Summary: This paper introduces a dynamic dataset pruning framework that adaptively selects training samples using task-driven difficulty and cross-modality semantic consistency, leveraging pretrained multimodal models to improve training efficiency and robustness across domains.
Authors:Zhaohong Huang, Yuxin Zhang, Jingjing Xie, Fei Chao, Rongrong Ji
Abstract:
Recent advances in test-time adaptation (TTA) for Vision-Language Models (VLMs) have garnered increasing attention, particularly through the use of multiple augmented views of a single image to boost zero-shot generalization. Unfortunately, existing methods fail to strike a satisfactory balance between performance and efficiency, either due to excessive overhead of tuning text prompts or unstable benefits from handcrafted, training-free visual feature enhancement. In this paper, we present Global-Spatial Bias Learner (GS-Bias), an efficient and effective TTA paradigm that incorporates two learnable biases during TTA, unfolded as the global bias and spatial bias. Particularly, the global bias captures the global semantic features of a test image by learning consistency across augmented views, while spatial bias learns the semantic coherence between regions in the image's spatial visual representation. It is worth highlighting that these two sets of biases are directly added to the logits outputed by the pretrained VLMs, which circumvent the full backpropagation through VLM that hinders the efficiency of existing TTA methods. This endows GS-Bias with extremely high efficiency while achieving state-of-the-art performance on 15 benchmark datasets. For example, it achieves a 2.23% improvement over TPT in cross-dataset generalization and a 2.72% improvement in domain generalization, while requiring only 6.5% of TPT's memory usage on ImageNet.
Chinese: GS-Bias方法通过引入全局和空间偏置,高效地提升了视觉语言模型的测试时适应能力,在15个基准数据集上实现最优性能且资源消耗极低。
English: The GS-Bias method introduces global and spatial biases to efficiently enhance Vision-Language Models' test-time adaptation, achieving top performance across 15 benchmarks with minimal resource usage.
Authors:Ozer Can Devecioglu, Serkan Kiranyaz, Mehmet Yamac, Moncef Gabbouj
Abstract:
The wide range of deformation artifacts that arise from complex light propagation, scattering, and depth-dependent attenuation makes the underwater image restoration to remain a challenging problem. Like other single deep regressor networks, conventional GAN-based restoration methods struggle to perform well across this heterogeneous domain, since a single generator network is typically insufficient to capture the full range of visual degradations. In order to overcome this limitation, we propose xOp-GAN, a novel GAN model with several expert generator networks, each trained solely on a particular subset with a certain image quality. Thus, each generator can learn to maximize its restoration performance for a particular quality range. Once a xOp-GAN is trained, each generator can restore the input image and the best restored image can then be selected by the discriminator based on its perceptual confidence score. As a result, xOP-GAN is the first GAN model with multiple generators where the discriminator is being used during the inference of the regression task. Experimental results on benchmark Large Scale Underwater Image (LSUI) dataset demonstrates that xOp-GAN achieves PSNR levels up to 25.16 dB, surpassing all single-regressor models by a large margin even, with reduced complexity.
Chinese: 提出的xOp-GAN模型采用多个针对特定图像质量子集训练的专家生成器,允许判别器在推理过程中选择最佳修复图像,在LSUI数据集上以高达25.16 dB的PSNR实现了水下图像修复的卓越性能。
English: The proposed xOp-GAN model utilizes multiple expert generators trained on specific image quality subsets, enabling the discriminator to select the best restored image during inference and achieving superior performance on underwater image restoration with up to 25.16 dB PSNR on the LSUI dataset.
Authors:Shiyi Yang, Xiaoxue Yu, Rongpeng Li, Jianhang Zhu, Zhifeng Zhao, Honggang Zhang
Abstract:
Operating Large Language Models (LLMs) on edge devices is increasingly challenged by limited communication bandwidth and strained computational and memory costs. Thus, cloud-assisted remote fine-tuning becomes indispensable. Nevertheless, existing Low-Rank Adaptation (LoRA) approaches typically employ fixed or heuristic rank configurations, and the subsequent over-the-air transmission of all LoRA parameters could be rather inefficient. To address this limitation, we develop AirLLM, a hierarchical diffusion policy framework for communication-aware LoRA adaptation. Specifically, AirLLM models the rank configuration as a structured action vector that spans all LoRA-inserted projections. To solve the underlying high-dimensional sequential decision-making problem, a Proximal Policy Optimization (PPO) agent generates coarse-grained decisions by jointly observing wireless states and linguistic complexity, which are then refined via Denoising Diffusion Implicit Models (DDIM) to produce high-resolution, task- and channel-adaptive rank vectors. The two modules are optimized alternatively, with the DDIM trained under the Classifier-Free Guidance (CFG) paradigm to maintain alignment with PPO rewards. Experiments under varying signal-to-noise ratios demonstrate that AirLLM consistently enhances fine-tuning performance while significantly reducing transmission costs, highlighting the effectiveness of reinforcement-driven, diffusion-refined rank adaptation for scalable and efficient remote fine-tuning over the air.
中文摘要:AirLLM提出了一种分层扩散策略框架,通过结合强化学习与扩散模型动态优化低秩适配的秩配置,在显著降低传输成本的同时持续提升边缘设备上的远程微调性能。
English Summary: AirLLM introduces a hierarchical diffusion policy framework that optimizes Low-Rank Adaptation (LoRA) for edge devices by combining reinforcement learning with diffusion models to dynamically adjust rank configurations, significantly improving fine-tuning performance while reducing transmission costs.
Authors:Junda Wu, Yuxin Xiong, Xintong Li, Zhengmian Hu, Tong Yu, Rui Wang, Xiang Chen, Jingbo Shang, Julian McAuley
Abstract:
Chain-of-thought (CoT) reasoning enables large language models (LLMs) to break down complex problems into interpretable intermediate steps, significantly enhancing model transparency and performance in reasoning tasks. However, conventional CoT methods rely on heuristic sampling without structured modeling of reasoning transitions, constraining their ability to systematically explore and discover diverse and effective reasoning trajectories. In this work, we introduce CTRLS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions, enabling principled and state-aware exploration via distributional reinforcement learning. By modelling reasoning actions as explicit probability distributions in latent space, our approach explicitly models epistemic uncertainty, facilitating robust exploration of the reasoning space. As part of our framework, we introduce an on-policy reinforcement learning strategy incorporating epsilon-greedy exploration and entropy-based regularization to iteratively refine latent state transitions without requiring additional fine-tuning of the underlying LLM. Theoretical analyses provide evidence lower bounds (ELBO), theoretically grounding our transition-aware modeling of latent reasoning dynamics. Further experiments demonstrate improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.
中文:CTRLS框架将思维链推理建模为具有潜在状态转换的马尔可夫决策过程,通过分布式强化学习实现系统性探索,从而提升推理的准确性和多样性。
English: The CTRLS framework models chain-of-thought reasoning as a Markov decision process with latent state transitions, enabling principled exploration through distributional reinforcement learning to improve reasoning accuracy and diversity.
Authors:Shuai Zhao, Yulin Zhang, Luwei Xiao, Xinyi Wu, Yanhao Jia, Zhongliang Guo, Xiaobao Wu, Cong-Duy Nguyen, Guoming Zhang, Anh Tuan Luu
Abstract:
Despite the remarkable progress of large language models (LLMs) across various domains, their capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored. To address this gap, we introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels. To systematically examine the predictive capabilities and affective biases of LLMs in ROP risk stratification, we propose Affective-ROPTester, an automated evaluation framework incorporating three prompting strategies: Instruction-based, Chain-of-Thought (CoT), and In-Context Learning (ICL). The Instruction scheme assesses LLMs' intrinsic knowledge and associated biases, whereas the CoT and ICL schemes leverage external medical knowledge to enhance predictive accuracy. Crucially, we integrate emotional elements at the prompt level to investigate how different affective framings influence the model's ability to predict ROP and its bias patterns. Empirical results derived from the CROP dataset yield two principal observations. First, LLMs demonstrate limited efficacy in ROP risk prediction when operating solely on intrinsic knowledge, yet exhibit marked performance gains when augmented with structured external inputs. Second, affective biases are evident in the model outputs, with a consistent inclination toward overestimating medium- and high-risk cases. Third, compared to negative emotions, positive emotional framing contributes to mitigating predictive bias in model outputs. These findings highlight the critical role of affect-sensitive prompt engineering in enhancing diagnostic reliability and emphasize the utility of Affective-ROPTester as a framework for evaluating and mitigating affective bias in clinical language modeling systems.
中文摘要:本研究通过CROP数据集和Affective-ROPTester框架评估大语言模型对早产儿视网膜病变风险的预测能力,发现外部知识可提升模型性能,而积极情绪提示能有效减轻预测偏差。
English Summary: This study introduces the CROP dataset and Affective-ROPTester framework to evaluate LLMs' ROP risk prediction, finding that external knowledge boosts performance while positive emotional framing reduces predictive bias.
Authors:Ziyao Wang, Rongpeng Li, Sizhao Li, Yuming Xiang, Haiping Wang, Zhifeng Zhao, Honggang Zhang
Abstract:
Intelligent control of Unmanned Aerial Vehicles (UAVs) swarms has emerged as a critical research focus, and it typically requires the swarm to navigate effectively while avoiding obstacles and achieving continuous coverage over multiple mission targets. Although traditional Multi-Agent Reinforcement Learning (MARL) approaches offer dynamic adaptability, they are hindered by the semantic gap in numerical communication and the rigidity of homogeneous role structures, resulting in poor generalization and limited task scalability. Recent advances in Large Language Model (LLM)-based control frameworks demonstrate strong semantic reasoning capabilities by leveraging extensive prior knowledge. However, due to the lack of online learning and over-reliance on static priors, these works often struggle with effective exploration, leading to reduced individual potential and overall system performance. To address these limitations, we propose a Role-Adaptive LLM-Driven Yoked navigation algorithm RALLY. Specifically, we first develop an LLM-driven semantic decision framework that uses structured natural language for efficient semantic communication and collaborative reasoning. Afterward, we introduce a dynamic role-heterogeneity mechanism for adaptive role switching and personalized decision-making. Furthermore, we propose a Role-value Mixing Network (RMIX)-based assignment strategy that integrates LLM offline priors with MARL online policies to enable semi-offline training of role selection strategies. Experiments in the Multi-Agent Particle Environment (MPE) environment and a Software-In-The-Loop (SITL) platform demonstrate that RALLY outperforms conventional approaches in terms of task coverage, convergence speed, and generalization, highlighting its strong potential for collaborative navigation in agentic multi-UAV systems.
中文: RALLY算法将大语言模型的语义推理与多智能体强化学习相结合,通过动态角色切换和半离线训练机制,在多无人机协同导航中实现了更优的任务覆盖率、更快收敛速度和更强泛化能力。
English: The proposed RALLY algorithm integrates Large Language Model semantic reasoning with Multi-Agent Reinforcement Learning to enable adaptive role-switching and semi-offline training, demonstrating superior performance in UAV swarm navigation through enhanced task coverage, faster convergence, and better generalization.
Authors:Xuewei Tang, Mengmeng Yang, Tuopu Wen, Peijin Jia, Le Cui, Mingshang Luo, Kehua Sheng, Bo Zhang, Diange Yang, Kun Jiang
Abstract:
With the growing interest in autonomous driving, there is an increasing demand for accurate and reliable road perception technologies. In complex environments without high-definition map support, autonomous vehicles must independently interpret their surroundings to ensure safe and robust decision-making. However, these scenarios pose significant challenges due to the large number, complex geometries, and frequent occlusions of road elements. A key limitation of existing approaches lies in their insufficient exploitation of the structured priors inherently present in road elements, resulting in irregular, inaccurate predictions. To address this, we propose PriorFusion, a unified framework that effectively integrates semantic, geometric, and generative priors to enhance road element perception. We introduce an instance-aware attention mechanism guided by shape-prior features, then construct a data-driven shape template space that encodes low-dimensional representations of road elements, enabling clustering to generate anchor points as reference priors. We design a diffusion-based framework that leverages these prior anchors to generate accurate and complete predictions. Experiments on large-scale autonomous driving datasets demonstrate that our method significantly improves perception accuracy, particularly under challenging conditions. Visualization results further confirm that our approach produces more accurate, regular, and coherent predictions of road elements.
中文摘要:提出的PriorFusion框架通过实例感知注意力和基于扩散的生成方法,有效融合语义、几何和生成先验,在复杂驾驶环境下显著提升了道路要素感知的准确性与规则性。
English Summary: The proposed PriorFusion framework integrates semantic, geometric, and generative priors through instance-aware attention and diffusion-based generation to significantly improve road element perception accuracy in autonomous driving under challenging conditions.
Authors:Qiujie Xie, Yixuan Weng, Minjun Zhu, Fuchen Shen, Shulin Huang, Zhen Lin, Jiahui Zhou, Zilan Mao, Zijie Yang, Linyi Yang, Jian Wu, Yue Zhang
Abstract:
The emergence of large language models (LLMs) is propelling automated scientific discovery to the next level, with LLM-based Artificial Intelligence (AI) Scientist systems now taking the lead in scientific research. Several influential works have already appeared in the field of AI Scientist systems, with AI-generated research papers having been accepted at the ICLR 2025 workshop, suggesting that a human-level AI Scientist capable of uncovering phenomena previously unknown to humans, may soon become a reality. In this survey, we focus on the central question: How far are AI scientists from changing the world and reshaping the scientific research paradigm? To answer this question, we provide a prospect-driven review that comprehensively analyzes the current achievements of AI Scientist systems, identifying key bottlenecks and the critical components required for the emergence of a scientific agent capable of producing ground-breaking discoveries that solve grand challenges. We hope this survey will contribute to a clearer understanding of limitations of current AI Scientist systems, showing where we are, what is missing, and what the ultimate goals for scientific AI should be.
中文: 大语言模型正推动AI科学家系统迈向人类水平的科学发现,本综述通过评估其进展来明确关键瓶颈与实现突破性科研的终极目标。
English: Large language models are advancing AI Scientist systems that may soon achieve human-level scientific discovery, and this survey reviews their progress to identify key challenges and goals for transformative breakthroughs.
Authors:Yanheng Hou, Xunkai Li, Zhenjun Li, Bing Zhou, Ronghua Li, Guoren Wang
Abstract:
In recent years, Hypergraph Neural Networks (HNNs) have demonstrated immense potential in handling complex systems with high-order interactions. However, acquiring large-scale, high-quality labeled data for these models is costly, making Active Learning (AL) a critical technique. Existing Graph Active Learning (GAL) methods, when applied to hypergraphs, often rely on techniques like "clique expansion," which destroys the high-order structural information crucial to a hypergraph's success, thereby leading to suboptimal performance. To address this challenge, we introduce HIAL (Hypergraph Active Learning), a native active learning framework designed specifically for hypergraphs. We innovatively reformulate the Hypergraph Active Learning (HAL) problem as an Influence Maximization task. The core of HIAL is a dual-perspective influence function that, based on our novel "High-Order Interaction-Aware (HOI-Aware)" propagation mechanism, synergistically evaluates a node's feature-space coverage (via Magnitude of Influence, MoI) and its topological influence (via Expected Diffusion Value, EDV). We prove that this objective function is monotone and submodular, thus enabling the use of an efficient greedy algorithm with a formal (1-1/e) approximation guarantee. Extensive experiments on seven public datasets demonstrate that HIAL significantly outperforms state-of-the-art baselines in terms of performance, efficiency, generality, and robustness, establishing an efficient and powerful new paradigm for active learning on hypergraphs.
中文: HIAL提出了一种创新的超图主动学习框架,将问题重构为影响力最大化,通过双视角影响力函数协同评估节点的特征覆盖与拓扑影响,在保证理论近似比的同时实现了卓越性能。
English: HIAL introduces a novel hypergraph active learning framework that reformulates the problem as influence maximization, utilizing a dual-perspective influence function to synergistically evaluate nodes' feature and topological impacts, achieving superior performance with proven theoretical guarantees.
Authors:StepFun, :, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin Miao, Chang Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengyuan Yao, Daokuan Lv, Dapeng Shi, Deshan Sun, Ding Huang, Dingyuan Hu, Dongqing Pang, Enle Liu, Fajie Zhang, Fanqi Wan, Gulin Yan, Han Zhang, Han Zhou, Hanghao Wu, Hangyu Guo, Hanqi Chen, Hanshan Zhang, Hao Wu, Haocheng Zhang, Haolong Yan, Haoran Lv, Haoran Wei, Hebin Zhou, Heng Wang, Heng Wang, Hongxin Li, Hongyu Zhou, Hongyuan Wang, Huiyong Guo, Jia Wang, Jiahao Gong, Jialing Xie, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yan, Jie Yang, Jieyi Hou, Jinguang Zhang, Jinlan Cao, Jisheng Yin, Junfeng Liu, Junhao Huang, Junzhe Lin, Kaijun Tan, Kaixiang Li, Kang An, Kangheng Lin, Kenkun Liu, Lei Yang, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lin Zhang, Lina Chen, Liwen Huang, Liying Shi, Longlong Gu, Mei Chen, Mengqiang Ren, Ming Li, Mingzhe Chen, Na Wang, Nan Wu, Qi Han, Qian Zhao, Qiang Zhang, Qianni Liu, Qiaohui Chen, Qiling Wu, Qinglin He, Qinyuan Tan, Qiufeng Wang, Qiuping Wu, Qiuyan Liang, Quan Sun, Rui Li, Ruihang Miao, Ruosi Wan, Ruyan Guo, Shangwu Zhong, Shaoliang Pang, Shengjie Fan, Shijie Shang, Shilei Jiang, Shiliang Yang, Shiming Hao, Shuli Gao, Siming Huang, Siqi Liu, Tiancheng Cao, Tianhao Cheng, Tianhao Peng, Wang You, Wei Ji, Wen Sun, Wenjin Deng, Wenqing He, Wenzhen Zheng, Xi Chen, Xiangwen Kong, Xianzhen Luo, Xiaobo Yang, Xiaojia Liu, Xiaoxiao Ren, Xin Han, Xin Li, Xin Wu, Xu Zhao, Yanan Wei, Yang Li, Yangguang Li, Yangshijie Xu, Yanming Xu, Yaqiang Shi, Yeqing Shen, Yi Yang, Yifei Yang, Yifeng Gong, Yihan Chen, Yijing Yang, Yinmin Zhang, Yizhuang Zhou, Yuanhao Ding, Yuantao Fan, Yuanzhen Yang, Yuchu Luo, Yue Peng, Yufan Lu, Yuhang Deng, Yuhe Yin, Yujie Liu, Yukun Chen, Yuling Zhao, Yun Mou, Yunlong Li, Yunzhou Ju, Yusheng Li, Yuxiang Yang, Yuxiang Zhang, Yuyang Chen, Zejia Weng, Zhe Xie, Zheng Ge, Zheng Gong, Zhenyi Lu, Zhewei Huang, Zhichao Chang, Zhiguo Huang, Zhirui Wang, Zidong Yang, Zili Wang, Ziqi Wang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Xiangyu Zhang
Abstract:
Large language models (LLMs) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3's 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.
中文: Step-3是一种3210亿参数的视觉语言模型,通过多矩阵分解注意力机制和注意力-前馈网络解耦的硬件协同设计,显著降低解码成本,并在相同设置下实现比同类模型更高的吞吐量。
English: Step-3 is a 321B-parameter vision-language model that introduces hardware-aware co-design with Multi-Matrix Factorization Attention and Attention-FFN Disaggregation to significantly reduce decoding costs while achieving higher throughput than comparable models.
Authors:Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin
Abstract:
This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
中文: 本文提出的分组序列策略优化(GSPO)算法通过序列级操作提升大语言模型的训练效率和稳定性,为Qwen3模型带来了显著性能提升。
English: This paper presents Group Sequence Policy Optimization (GSPO), a reinforcement learning algorithm that enhances training efficiency and stability for large language models by employing sequence-level operations, leading to significant performance gains in the Qwen3 models.
Authors:Qibing Bai, Sho Inoue, Shuai Wang, Zhongjie Jiang, Yannan Wang, Haizhou Li
Abstract:
Accent normalization converts foreign-accented speech into native-like speech while preserving speaker identity. We propose a novel pipeline using self-supervised discrete tokens and non-parallel training data. The system extracts tokens from source speech, converts them through a dedicated model, and synthesizes the output using flow matching. Our method demonstrates superior performance over a frame-to-frame baseline in naturalness, accentedness reduction, and timbre preservation across multiple English accents. Through token-level phonetic analysis, we validate the effectiveness of our token-based approach. We also develop two duration preservation methods, suitable for applications such as dubbing.
中文摘要:本研究提出了一种新颖的口音标准化流程,利用自监督离散标记和非平行数据,将带外国口音的语音转换为类似母语的语音并保持说话人身份,在自然度、口音减弱和音色保真方面优于基线方法。
English Summary: This study introduces an accent normalization pipeline that leverages self-supervised discrete tokens and non-parallel data to convert foreign-accented speech into native-like output while maintaining speaker identity, outperforming baselines in naturalness, accent reduction, and timbre preservation.
Authors:Senyao Li, Haozhao Wang, Wenchao Xu, Rui Zhang, Song Guo, Jingling Yuan, Xian Zhong, Tianwei Zhang, Ruixuan Li
Abstract:
As large language models (LLMs) evolve, deploying them solely in the cloud or compressing them for edge devices has become inadequate due to concerns about latency, privacy, cost, and personalization. This survey explores a collaborative paradigm in which cloud-based LLMs and edge-deployed small language models (SLMs) cooperate across both inference and training. We present a unified taxonomy of edge-cloud collaboration strategies. For inference, we categorize approaches into task assignment, task division, and mixture-based collaboration at both task and token granularity, encompassing adaptive scheduling, resource-aware offloading, speculative decoding, and modular routing. For training, we review distributed adaptation techniques, including parameter alignment, pruning, bidirectional distillation, and small-model-guided optimization. We further summarize datasets, benchmarks, and deployment cases, and highlight privacy-preserving methods and vertical applications. This survey provides the first systematic foundation for LLM-SLM collaboration, bridging system and algorithm co-design to enable efficient, scalable, and trustworthy edge-cloud intelligence.
中文总结:本综述提出了一种云端大语言模型与边缘小语言模型在推理和训练中协同合作的框架,通过系统分类协作策略,为解决边缘云智能中的延迟、隐私和效率问题提供了首个方法论基础。
English Summary: This survey introduces a collaborative framework where cloud-based large language models (LLMs) and edge-deployed small language models (SLMs) work together across inference and training, presenting a unified taxonomy of strategies to address latency, privacy, and efficiency concerns in edge-cloud intelligence.
Authors:Xuechen Liu, Wanying Ge, Xin Wang, Junichi Yamagishi
Abstract:
This study introduces LENS-DF, a novel and comprehensive recipe for training and evaluating audio deepfake detection and temporal localization under complicated and realistic audio conditions. The generation part of the recipe outputs audios from the input dataset with several critical characteristics, such as longer duration, noisy conditions, and containing multiple speakers, in a controllable fashion. The corresponding detection and localization protocol uses models. We conduct experiments based on self-supervised learning front-end and simple back-end. The results indicate that models trained using data generated with LENS-DF consistently outperform those trained via conventional recipes, demonstrating the effectiveness and usefulness of LENS-DF for robust audio deepfake detection and localization. We also conduct ablation studies on the variations introduced, investigating their impact on and relevance to realistic challenges in the field.
中文: 本研究提出LENS-DF框架,通过可控生成包含长时长、噪声环境和多说话人等复杂特征的音频数据,在真实场景下的音频深度伪造检测与时间定位任务中持续优于传统方法。
English: This study presents LENS-DF, a comprehensive framework for training and evaluating audio deepfake detection and temporal localization under realistic conditions, which consistently outperforms conventional methods through controllable generation of complex audio characteristics.
Authors:Junjie Li, Wenxuan Wu, Shuai Wang, Zexu Pan, Kong Aik Lee, Helen Meng, Haizhou Li
Abstract:
Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate a target speaker's voice from multi-speaker environments by leveraging visual cues as guidance. However, the performance of AV-TSE systems heavily relies on the quality of these visual cues. In extreme scenarios where visual cues are missing or severely degraded, the system may fail to accurately extract the target speaker. In contrast, humans can maintain attention on a target speaker even in the absence of explicit auxiliary information. Motivated by such human cognitive ability, we propose a novel framework called MeMo, which incorporates two adaptive memory banks to store attention-related information. MeMo is specifically designed for real-time scenarios: once initial attention is established, the system maintains attentional momentum over time, even when visual cues become unavailable. We conduct comprehensive experiments to verify the effectiveness of MeMo. Experimental results demonstrate that our proposed framework achieves SI-SNR improvements of at least 2 dB over the corresponding baseline.
Chinese: MeMo框架通过引入自适应记忆库,能够在视觉线索缺失或受损时持续追踪目标说话人,显著提升了视听语音提取系统的性能表现。
English: The MeMo framework introduces adaptive memory banks to maintain attention on target speakers in audio-visual extraction systems, achieving significant performance improvements even when visual cues are degraded or absent.
Authors:Jan Trienes, Anastasiia Derzhanskaia, Roland Schwarzkopf, Markus Mühling, Jörg Schlötterer, Christin Seifert
Abstract:
We present Marcel, a lightweight and open-source conversational agent designed to support prospective students with admission-related inquiries. The system aims to provide fast and personalized responses, while reducing workload of university staff. We employ retrieval-augmented generation to ground answers in university resources and to provide users with verifiable, contextually relevant information. To improve retrieval quality, we introduce an FAQ retriever that maps user questions to knowledge-base entries, allowing administrators to steer retrieval, and improving over standard dense/hybrid retrieval strategies. The system is engineered for easy deployment in resource-constrained academic settings. We detail the system architecture, provide a technical evaluation of its components, and report insights from a real-world deployment.
中文:Marcel是一款轻量级开源对话助手,采用检索增强生成技术和FAQ检索器,为意向学生提供快速个性化的入学咨询支持,减轻大学工作人员负担,专为资源有限的学术环境设计部署。
English: Marcel is a lightweight, open-source conversational agent that uses retrieval-augmented generation and a specialized FAQ retriever to provide prospective students with fast, personalized admission support while easing university staff workload, designed for easy deployment in resource-limited academic environments.
Authors:Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip S. Yu, Ying Li
Abstract:
As large language models (LLMs) grow increasingly sophisticated and pervasive, their application to various Artificial Intelligence for IT Operations (AIOps) tasks has garnered significant attention. However, a comprehensive understanding of the impact, potential, and limitations of LLMs in AIOps remains in its infancy. To address this gap, we conducted a detailed survey of LLM4AIOps, focusing on how LLMs can optimize processes and improve outcomes in this domain. We analyzed 183 research papers published between January 2020 and December 2024 to answer four key research questions (RQs). In RQ1, we examine the diverse failure data sources utilized, including advanced LLM-based processing techniques for legacy data and the incorporation of new data sources enabled by LLMs. RQ2 explores the evolution of AIOps tasks, highlighting the emergence of novel tasks and the publication trends across these tasks. RQ3 investigates the various LLM-based methods applied to address AIOps challenges. Finally, RQ4 reviews evaluation methodologies tailored to assess LLM-integrated AIOps approaches. Based on our findings, we discuss the state-of-the-art advancements and trends, identify gaps in existing research, and propose promising directions for future exploration.
中文摘要:随着大语言模型在智能运维领域的广泛应用,其全面影响与潜力仍待深入探究,本文通过分析183篇文献系统梳理了数据源、任务演进、方法策略及评估体系,并指出了未来研究方向。
English Summary: Large language models are increasingly applied in AIOps, yet their comprehensive impact and potential remain underexplored, prompting a detailed survey of 183 studies to analyze data sources, task evolution, methodologies, and evaluation techniques while identifying future research directions.
Authors:Yi Zhao, Zuchao Li, Hai Zhao
Abstract:
LLMs encounter significant challenges in resource consumption nowadays, especially with long contexts. Despite extensive efforts dedicate to enhancing inference efficiency, these methods primarily exploit internal sparsity within the models, without leveraging external information for optimization. We identify the high similarity of attention matrices across different-scale LLMs, which offers a novel perspective for optimization. We first conduct a comprehensive analysis of how to measure similarity, how to select mapping Layers and whether mapping is consistency. Based on these insights, we introduce the IAM framework, which achieves dual benefits of accelerated attention computation and reduced KV cache usage by performing attention mapping between small and large LLMs. Our experimental results demonstrate that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1% without appreciably sacrificing performance. Experiments on different series of models show the generalizability of IAM. Importantly, it is also orthogonal to many existing KV cache optimization methods, making it a versatile addition to the current toolkit for enhancing LLM efficiency.
中文总结:IAM框架通过将小规模模型的注意力矩阵映射至大规模模型,实现了15%的预填充加速和22.1%的KV缓存减少,且不影响模型性能。
English Summary: The IAM framework accelerates large language models by mapping attention matrices from smaller models, achieving 15% faster prefill and 22.1% KV cache reduction without performance loss.
Authors:Yi Zhao, Zuchao Li, Hai Zhao, Baoyuan Qi, Guoming Liu
Abstract:
Task-agnostic prompt compression leverages the redundancy in natural language to reduce computational overhead and enhance information density within prompts, especially in long-context scenarios. Existing methods predominantly rely on information entropy as the metric to compress lexical units, aiming to achieve minimal information loss. However, these approaches overlook two critical aspects: (i) the importance of attention-critical tokens at the algorithmic level, and (ii) shifts in information entropy during the compression process. Motivated by these challenges, we propose a dynamic attention-aware approach for task-agnostic prompt compression (DAC). This approach effectively integrates entropy and attention information, dynamically sensing entropy shifts during compression to achieve fine-grained prompt compression. Extensive experiments across various domains, including LongBench, GSM8K, and BBH, show that DAC consistently yields robust and substantial improvements across a diverse range of tasks and LLMs, offering compelling evidence of its efficacy.
Chinese: 提出的动态注意力感知压缩(DAC)方法通过整合熵与注意力信息,动态感知压缩过程中的熵变化,实现了细粒度的提示压缩,并在多种任务和语言模型中展现出稳健的性能提升。
English: The proposed dynamic attention-aware compression (DAC) method integrates entropy and attention information to dynamically address entropy shifts, achieving fine-grained prompt compression with robust performance improvements across multiple tasks and language models.
Authors:Che Liu, Jiazhen Pan, Weixiang Shen, Wenjia Bai, Daniel Rueckert, Rossella Arcucci
Abstract:
Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare; however, their competence in medical tasks remains underexplored. We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, ranging from 3B to 72B parameters, across eight benchmarks: MedXpert, OmniMedVQA, PMC-VQA, PathVQA, MMMU, SLAKE, and VQA-RAD. To observe model performance across different aspects, we first separate it into understanding and reasoning components. Three salient findings emerge. First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images. Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support. Third, performance varies widely across benchmarks, reflecting differences in task design, annotation quality, and knowledge demands. No model yet reaches the reliability threshold for clinical deployment, underscoring the need for stronger multimodal alignment and more rigorous, fine-grained evaluation protocols.
中文: 通用视觉语言模型在部分医学任务上已能媲美甚至超越专业模型,但其推理能力不足且临床可靠性尚未达标,亟需加强多模态对齐与精细化评估。
English: General-purpose vision-language models can match or exceed specialized medical models in certain tasks but still fall short in reasoning reliability and clinical applicability, highlighting the need for improved multimodal alignment and evaluation.
Authors:Yichen Li, Xiuying Wang, Wenchao Xu, Haozhao Wang, Yining Qi, Jiahua Dong, Ruixuan Li
Abstract:
Model-Heterogeneous Federated Learning (Hetero-FL) has attracted growing attention for its ability to aggregate knowledge from heterogeneous models while keeping private data locally. To better aggregate knowledge from clients, ensemble distillation, as a widely used and effective technique, is often employed after global aggregation to enhance the performance of the global model. However, simply combining Hetero-FL and ensemble distillation does not always yield promising results and can make the training process unstable. The reason is that existing methods primarily focus on logit distillation, which, while being model-agnostic with softmax predictions, fails to compensate for the knowledge bias arising from heterogeneous models. To tackle this challenge, we propose a stable and efficient Feature Distillation for model-heterogeneous Federated learning, dubbed FedFD, that can incorporate aligned feature information via orthogonal projection to integrate knowledge from heterogeneous models better. Specifically, a new feature-based ensemble federated knowledge distillation paradigm is proposed. The global model on the server needs to maintain a projection layer for each client-side model architecture to align the features separately. Orthogonal techniques are employed to re-parameterize the projection layer to mitigate knowledge bias from heterogeneous models and thus maximize the distilled knowledge. Extensive experiments show that FedFD achieves superior performance compared to state-of-the-art methods.
中文:FedFD提出了一种新颖的模型异构联邦学习特征蒸馏方法,通过正交投影对齐特征并减少知识偏差,在性能上超越了现有先进方法。
English: FedFD introduces a novel feature distillation method for model-heterogeneous federated learning, using orthogonal projection to align features and mitigate knowledge bias, achieving superior performance over existing techniques.
Authors:Shulin Huang, Linyi Yang, Yue Zhang
Abstract:
Large language models exhibit cultural biases and limited cross-cultural understanding capabilities, particularly when serving diverse global user populations. We propose MCEval, a novel multilingual evaluation framework that employs dynamic cultural question construction and enables causal analysis through Counterfactual Rephrasing and Confounder Rephrasing. Our comprehensive evaluation spans 13 cultures and 13 languages, systematically assessing both cultural awareness and cultural bias across different linguistic scenarios. The framework provides 39,897 cultural awareness instances and 17,940 cultural bias instances. Experimental results reveal performance disparities across different linguistic scenarios, demonstrating that optimal cultural performance is not only linked to training data distribution, but also is related to language-culture alignment. The evaluation results also expose the fairness issue, where approaches appearing successful in the English scenario create substantial disadvantages. MCEval represents the first comprehensive multilingual cultural evaluation framework that provides deeper insights into LLMs' cultural understanding.
中文: MCEval作为首个全面的多语言文化评估框架,通过动态文化问题构建和因果分析,系统评估了13种文化和语言中大语言模型的文化认知与偏见,揭示了与训练数据及语言文化对齐相关的性能差异及公平性问题。
English: MCEval is a novel multilingual evaluation framework that assesses cultural awareness and bias in large language models across 13 cultures and languages, revealing performance disparities linked to training data and language-culture alignment while exposing fairness issues in cross-cultural scenarios.
Authors:Kevin Rojas, Ye He, Chieh-Hsin Lai, Yuta Takida, Yuki Mitsufuji, Molei Tao
Abstract:
Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and recent works have extended it to discrete diffusion. This paper theoretically analyzes CFG in the context of masked discrete diffusion, focusing on the role of guidance schedules. Our analysis shows that high guidance early in sampling (when inputs are heavily masked) harms generation quality, while late-stage guidance has a larger effect. These findings provide a theoretical explanation for empirical observations in recent studies on guidance schedules. The analysis also reveals an imperfection of the current CFG implementations. These implementations can unintentionally cause imbalanced transitions, such as unmasking too rapidly during the early stages of generation, which degrades the quality of the resulting samples. To address this, we draw insight from the analysis and propose a novel classifier-free guidance mechanism empirically applicable to any discrete diffusion. Intuitively, our method smoothens the transport between the data distribution and the initial (masked/uniform) distribution, which results in improved sample quality. Remarkably, our method is achievable via a simple one-line code change. The efficacy of our method is empirically demonstrated with experiments on ImageNet (masked discrete diffusion) and QM9 (uniform discrete diffusion).
中文摘要:本文理论分析了掩码离散扩散中的无分类器引导机制,发现早期高强度引导会损害生成质量而后期引导效果更显著,并提出一种简单的一行代码修改方法以平滑过渡并提升样本质量。
English Summary: This paper theoretically analyzes classifier-free guidance in masked discrete diffusion, revealing that early high guidance harms quality while late-stage guidance is more impactful, and proposes a simple one-line code modification to smooth transitions and improve sample quality.
Authors:Jingjing Tang, Xin Wang, Zhe Zhang, Junichi Yamagishi, Geraint Wiggins, George Fazekas
Abstract:
Generating expressive audio performances from music scores requires models to capture both instrument acoustics and human interpretation. Traditional music performance synthesis pipelines follow a two-stage approach, first generating expressive performance MIDI from a score, then synthesising the MIDI into audio. However, the synthesis models often struggle to generalise across diverse MIDI sources, musical styles, and recording environments. To address these challenges, we propose MIDI-VALLE, a neural codec language model adapted from the VALLE framework, which was originally designed for zero-shot personalised text-to-speech (TTS) synthesis. For performance MIDI-to-audio synthesis, we improve the architecture to condition on a reference audio performance and its corresponding MIDI. Unlike previous TTS-based systems that rely on piano rolls, MIDI-VALLE encodes both MIDI and audio as discrete tokens, facilitating a more consistent and robust modelling of piano performances. Furthermore, the model's generalisation ability is enhanced by training on an extensive and diverse piano performance dataset. Evaluation results show that MIDI-VALLE significantly outperforms a state-of-the-art baseline, achieving over 75% lower Frechet Audio Distance on the ATEPP and Maestro datasets. In the listening test, MIDI-VALLE received 202 votes compared to 58 for the baseline, demonstrating improved synthesis quality and generalisation across diverse performance MIDI inputs.
中文: MIDI-VALLE通过将MIDI和音频编码为离散标记并基于多样化数据集训练,显著提升了钢琴演奏合成的表现力和泛化能力,在质量和适应性上均优于现有先进方法。
English: MIDI-VALLE, a neural codec language model adapted from VALLE, enhances expressive piano performance synthesis by encoding both MIDI and audio as discrete tokens and training on a diverse dataset, achieving superior audio quality and generalization over state-of-the-art methods.
Authors:Zhibo Zhang, Yuxi Li, Kailong Wang, Shuai Yuan, Ling Shi, Haoyu Wang
Abstract:
Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity. However, this openness also introduces significant security risks, particularly through embedding space poisoning, which is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood. Consequently, more targeted and accurate adversarial perturbation techniques, which pose significant threats, have not been adequately studied.
In this work, we propose ETTA (Embedding Transformation Toxicity Attenuation), a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs using the AdvBench benchmark, ETTA achieves a high average attack success rate of 88.61%, outperforming the best baseline by 11.34%, and generalizes to safety-enhanced models (e.g., 77.39% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and underscore the need for embedding-aware defenses.
中文: 大语言模型面临嵌入空间投毒的安全风险,ETTA框架通过线性变换有效规避安全机制并实现高攻击成功率,揭示了现有对齐策略的脆弱性。
English: Large Language Models face security risks from embedding space poisoning, addressed by the ETTA framework which effectively bypasses safety mechanisms with high success rates, revealing vulnerabilities in current alignment strategies.
Authors:Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Peihao Wang, Jing Xiong, Liliang Ren, Hao Cheng, Janardhan Kulkarni, Yelong Shen, Atlas Wang, Mac Schwager, Anderson Schneider, Xiaodong Liu, Jianfeng Gao
Abstract:
The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.
Chinese: 该研究提出模拟注意力分数(SAS)方法,通过高效投影和聚合技术模拟更多注意力头和更大隐藏维度,在不增加参数的情况下提升Transformer性能。
English: The study introduces Simulated Attention Score (SAS), a method that enhances Transformer performance by simulating more attention heads and larger hidden dimensions without increasing parameters, using efficient projections and aggregation techniques.
Authors:Tao Feng, Xianbing Zhao, Zhenhua Chen, Tien Tsin Wong, Hamid Rezatofighi, Gholamreza Haffari, Lizhen Qu
Abstract:
Recent advances in diffusion-based and autoregressive video generation models have achieved remarkable visual realism. However, these models typically lack accurate physical alignment, failing to replicate real-world dynamics in object motion. This limitation arises primarily from their reliance on learned statistical correlations rather than capturing mechanisms adhering to physical laws. To address this issue, we introduce a novel framework that integrates symbolic regression (SR) and trajectory-guided image-to-video (I2V) models for physics-grounded video forecasting. Our approach extracts motion trajectories from input videos, uses a retrieval-based pre-training mechanism to enhance symbolic regression, and discovers equations of motion to forecast physically accurate future trajectories. These trajectories then guide video generation without requiring fine-tuning of existing models. Evaluated on scenarios in Classical Mechanics, including spring-mass, pendulums, and projectile motions, our method successfully recovers ground-truth analytical equations and improves the physical alignment of generated videos over baseline methods.
中文摘要:本研究提出了一种基于物理的视频预测框架,通过结合符号回归与轨迹引导的图像转视频模型,从输入视频中推导运动方程,从而生成物理准确性更高的视频。
English Summary: This study introduces a physics-grounded video forecasting framework that combines symbolic regression with trajectory-guided image-to-video models to generate videos with improved physical accuracy by discovering motion equations from input videos.
Authors:Zhenyang Liu, Sixiao Zheng, Siyu Chen, Cairong Zhao, Longfei Liang, Xiangyang Xue, Yanwei Fu
Abstract:
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fields through neural representations enables accurate understanding of 3D scenes from limited viewpoints and facilitates the localization of target objects in complex environments. However, existing language field methods struggle to accurately localize instances using spatial relations in language queries, such as ``the book on the chair.'' This limitation mainly arises from inadequate reasoning about spatial relations in both language queries and 3D scenes. In this work, we propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced hierarchical feature field for open-vocabulary 3D visual grounding. To enable spatial reasoning in language queries, SpatialReasoner fine-tunes an LLM to capture spatial relations and explicitly infer instructions for the target, anchor, and spatial relation. To enable spatial reasoning in 3D scenes, SpatialReasoner incorporates visual properties (opacity and color) to construct a hierarchical feature field. This field represents language and instance features using distilled CLIP features and masks extracted via the Segment Anything Model (SAM). The field is then queried using the inferred instructions in a hierarchical manner to localize the target 3D instance based on the spatial relation in the language query. Extensive experiments show that our framework can be seamlessly integrated into different neural representations, outperforming baseline models in 3D visual grounding while empowering their spatial reasoning capability.
中文: SpatialReasoner框架通过结合大语言模型驱动的空间推理与分层特征场,提升了开放词汇3D视觉定位能力,有效解决了空间关系理解不足的问题,并在复杂环境中实现了更精准的目标定位。
English: The proposed SpatialReasoner framework enhances open-vocabulary 3D visual grounding by integrating LLM-driven spatial reasoning with a hierarchical feature field, overcoming limitations in spatial relation understanding and achieving superior localization accuracy across diverse environments.
Authors:Zhenyang Liu, Yikai Wang, Kuanning Wang, Longfei Liang, Xiangyang Xue, Yanwei Fu
Abstract:
Visual imitation learning is effective for robots to learn versatile tasks. However, many existing methods rely on behavior cloning with supervised historical trajectories, limiting their 3D spatial and 4D spatiotemporal awareness. Consequently, these methods struggle to capture the 3D structures and 4D spatiotemporal relationships necessary for real-world deployment. In this work, we propose 4D Diffusion Policy (DP4), a novel visual imitation learning method that incorporates spatiotemporal awareness into diffusion-based policies. Unlike traditional approaches that rely on trajectory cloning, DP4 leverages a dynamic Gaussian world model to guide the learning of 3D spatial and 4D spatiotemporal perceptions from interactive environments. Our method constructs the current 3D scene from a single-view RGB-D observation and predicts the future 3D scene, optimizing trajectory generation by explicitly modeling both spatial and temporal dependencies. Extensive experiments across 17 simulation tasks with 173 variants and 3 real-world robotic tasks demonstrate that the 4D Diffusion Policy (DP4) outperforms baseline methods, improving the average simulation task success rate by 16.4% (Adroit), 14% (DexArt), and 6.45% (RLBench), and the average real-world robotic task success rate by 8.6%.
Chinese: 提出的4D扩散策略(DP4)通过动态高斯世界模型将时空感知融入视觉模仿学习,在仿真和真实机器人任务中均显著超越基线方法,实现了更高的成功率。
English: The proposed 4D Diffusion Policy (DP4) introduces spatiotemporal awareness into visual imitation learning through a dynamic Gaussian world model, significantly outperforming baseline methods in both simulation and real-world robotic tasks with improved success rates.
Authors:Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel
Abstract:
The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.
中文: 本研究为多模态大语言模型引入了两阶段训练范式,通过语言微调与多模态强化学习相结合,开发出Open-Vision-Reasoner模型,该模型凭借早期行为迁移和有效视觉模式的策略性扩展,在多项推理基准测试中实现了最先进的性能表现。
English: This research introduces a two-stage training paradigm for Multimodal LLMs that combines linguistic fine-tuning with multimodal reinforcement learning, yielding the Open-Vision-Reasoner model which achieves state-of-the-art performance on reasoning benchmarks through early behavior transfer and strategic scaling of effective visual patterns.
Authors:Benjamin Li, Shuyang Shi, Lucia Romero, Huao Li, Yaqi Xie, Woojun Kim, Stefanos Nikolaidis, Michael Lewis, Katia Sycara, Simon Stepputtis
Abstract:
In collaborative tasks, being able to adapt to your teammates is a necessary requirement for success. When teammates are heterogeneous, such as in human-agent teams, agents need to be able to observe, recognize, and adapt to their human partners in real time. This becomes particularly challenging in tasks with time pressure and complex strategic spaces where the dynamics can change rapidly. In this work, we introduce TALENTS, a strategy-conditioned cooperator framework that learns to represent, categorize, and adapt to a range of partner strategies, enabling ad-hoc teamwork. Our approach utilizes a variational autoencoder to learn a latent strategy space from trajectory data. This latent space represents the underlying strategies that agents employ. Subsequently, the system identifies different types of strategy by clustering the data. Finally, a cooperator agent is trained to generate partners for each type of strategy, conditioned on these clusters. In order to adapt to previously unseen partners, we leverage a fixed-share regret minimization algorithm that infers and adjusts the estimated partner strategy dynamically. We assess our approach in a customized version of the Overcooked environment, posing a challenging cooperative cooking task that demands strong coordination across a wide range of possible strategies. Using an online user study, we show that our agent outperforms current baselines when working with unfamiliar human partners.
Chinese: TALENTS框架通过变分自编码器学习和分类策略,并利用遗憾最小化算法动态调整,使智能体能在协作任务中适应多样化的人类伙伴,在人类与智能体团队合作中表现优于现有基准方法。
English: The TALENTS framework enables agents to adapt to diverse human partners in collaborative tasks by learning and categorizing strategies through a variational autoencoder and dynamically adjusting via regret minimization, outperforming existing methods in human-agent teamwork.
Authors:Sho Inoue, Kun Zhou, Shuai Wang, Haizhou Li
Abstract:
We investigate hierarchical emotion distribution (ED) for achieving multi-level quantitative control of emotion rendering in text-to-speech synthesis (TTS). We introduce a novel multi-step hierarchical ED prediction module that quantifies emotion variance at the utterance, word, and phoneme levels. By predicting emotion variance in a multi-step manner, we leverage global emotional context to refine local emotional variations, thereby capturing the intrinsic hierarchical structure of speech emotion. Our approach is validated through its integration into a variance adaptor and an external module design compatible with various TTS systems. Both objective and subjective evaluations demonstrate that the proposed framework significantly enhances emotional expressiveness and enables precise control of emotion rendering across multiple speech granularities.
中文摘要:本研究提出了一种用于语音合成的分层情感分布框架,通过新颖的多步预测模块在语句、词语和音素层面实现多层次情感控制,显著提升了语音的情感表现力。
English Summary: This study introduces a hierarchical emotion distribution framework for text-to-speech synthesis, enabling multi-level emotion control across utterance, word, and phoneme levels through a novel prediction module that significantly enhances emotional expressiveness.
Authors:Weilun Feng, Chuanguang Yang, Haotong Qin, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Boyu Diao, Fuzhen Zhuang, Michele Magno, Yongjun Xu, Yingli Tian, Tingwen Huang
Abstract:
Diffusion models have demonstrated remarkable performance on vision generation tasks. However, the high computational complexity hinders its wide application on edge devices. Quantization has emerged as a promising technique for inference acceleration and memory reduction. However, existing quantization methods do not generalize well under extremely low-bit (2-4 bit) quantization. Directly applying these methods will cause severe performance degradation. We identify that the existing quantization framework suffers from the outlier-unfriendly quantizer design, suboptimal initialization, and optimization strategy. We present MPQ-DMv2, an improved \textbf{M}ixed \textbf{P}recision \textbf{Q}uantization framework for extremely low-bit \textbf{D}iffusion \textbf{M}odels. For the quantization perspective, the imbalanced distribution caused by salient outliers is quantization-unfriendly for uniform quantizer. We propose \textit{Flexible Z-Order Residual Mixed Quantization} that utilizes an efficient binary residual branch for flexible quant steps to handle salient error. For the optimization framework, we theoretically analyzed the convergence and optimality of the LoRA module and propose \textit{Object-Oriented Low-Rank Initialization} to use prior quantization error for informative initialization. We then propose \textit{Memory-based Temporal Relation Distillation} to construct an online time-aware pixel queue for long-term denoising temporal information distillation, which ensures the overall temporal consistency between quantized and full-precision model. Comprehensive experiments on various generation tasks show that our MPQ-DMv2 surpasses current SOTA methods by a great margin on different architectures, especially under extremely low-bit widths.
中文摘要:MPQ-DMv2提出了一种混合精度量化框架,通过灵活量化策略和优化方法解决了极低位宽下扩散模型量化性能下降的问题,在不同生成任务中大幅超越现有最优方法。
English Summary: MPQ-DMv2 introduces a mixed precision quantization framework that addresses outlier-unfriendly quantizer design and optimization challenges, significantly outperforming existing methods in extremely low-bit diffusion model quantization across various generation tasks.
Authors:Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, Siyuan Huang, Qing Li
Abstract:
Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce \underline{\textbf{M}}ove \underline{\textbf{t}}o \underline{\textbf{U}}nderstand (\textbf{\model}), a unified framework that integrates active perception with \underline{\textbf{3D}} vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations: 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames, eliminating the need for explicit 3D reconstruction. 2) A unified objective for grounding and exploring, which represents unexplored locations as frontier queries and jointly optimizes object grounding and frontier selection. 3) End-to-end trajectory learning that combines \textbf{V}ision-\textbf{L}anguage-\textbf{E}xploration pre-training over a million diverse trajectories collected from both simulated and real-world RGB-D sequences. Extensive evaluations across various embodied navigation and question-answering benchmarks show that MTU3D outperforms state-of-the-art reinforcement learning and modular navigation approaches by 14\%, 23\%, 9\%, and 2\% in success rate on HM3D-OVON, GOAT-Bench, SG3D, and A-EQA, respectively. \model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images. These findings highlight the importance of bridging visual grounding and exploration for embodied intelligence.
中文: 针对现有3D视觉语言模型缺乏主动环境感知能力的问题,本文提出MTU3D统一框架,通过在线表征学习、探索与定位联合优化及端到端轨迹训练三大创新,将主动感知与3D视觉语言学习相结合,在多个具身智能基准测试中取得显著性能提升。
English: To overcome the limitations of existing 3D Vision-Language models in active environmental perception, the authors introduce MTU3D, a unified framework that integrates active exploration with 3D vision-language learning through innovations in online representation, unified grounding-exploration objectives, and end-to-end trajectory training, achieving superior performance on multiple benchmarks.
Authors:Boyang Wang, Yalun Wu, Hongcheng Guo, Zhoujun Li
Abstract:
As digital emotional support needs grow, Large Language Model companions offer promising authentic, always-available empathy, though rigorous evaluation lags behind model advancement. We present Heart-to-Heart Talk (H2HTalk), a benchmark assessing companions across personality development and empathetic interaction, balancing emotional intelligence with linguistic fluency. H2HTalk features 4,650 curated scenarios spanning dialogue, recollection, and itinerary planning that mirror real-world support conversations, substantially exceeding previous datasets in scale and diversity. We incorporate a Secure Attachment Persona (SAP) module implementing attachment-theory principles for safer interactions. Benchmarking 50 LLMs with our unified protocol reveals that long-horizon planning and memory retention remain key challenges, with models struggling when user needs are implicit or evolve mid-conversation. H2HTalk establishes the first comprehensive benchmark for emotionally intelligent companions. We release all materials to advance development of LLMs capable of providing meaningful and safe psychological support.
中文: 该摘要介绍了H2HTalk,这是一个评估大语言模型伴侣在共情互动和个性发展方面的综合基准,揭示了长期规划和记忆方面的挑战,并通过其安全依恋人格模块推动更安全的AI心理支持发展。
English: The abstract introduces H2HTalk, a comprehensive benchmark for evaluating Large Language Model companions on empathetic interactions and personality development, highlighting challenges in long-horizon planning and memory while promoting safer AI support through its Secure Attachment Persona module.
Authors:Yuchen Ma, Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel
Abstract:
Estimating treatment effects is crucial for personalized decision-making in medicine, but this task faces unique challenges in clinical practice. At training time, models for estimating treatment effects are typically trained on well-structured medical datasets that contain detailed patient information. However, at inference time, predictions are often made using textual descriptions (e.g., descriptions with self-reported symptoms), which are incomplete representations of the original patient information. In this work, we make three contributions. (1) We show that the discrepancy between the data available during training time and inference time can lead to biased estimates of treatment effects. We formalize this issue as an inference time text confounding problem, where confounders are fully observed during training time but only partially available through text at inference time. (2) To address this problem, we propose a novel framework for estimating treatment effects that explicitly accounts for inference time text confounding. Our framework leverages large language models together with a custom doubly robust learner to mitigate biases caused by the inference time text confounding. (3) Through a series of experiments, we demonstrate the effectiveness of our framework in real-world applications.
中文: 本研究揭示了当基于结构化数据训练的模型应用于文本描述时,推理阶段的文本混杂会导致治疗效果估计偏差,并提出了一种结合大语言模型与双重稳健学习器的新框架,在实际应用中有效缓解了这一问题。
English: This study identifies inference time text confounding as a key issue causing biased treatment effect estimates when models trained on structured data are applied to textual descriptions, and proposes a novel framework combining large language models with a doubly robust learner to effectively mitigate this bias in real-world applications.
Authors:Xiyuxing Zhang, Duc Vu, Chengyi Shen, Yuntao Wang, Yuanchun Shi, Justin Chan
Abstract:
There is growing industry interest in creating unobtrusive designs for electrooculography (EOG) sensing of eye gestures on glasses (e.g. JINS MEME and Apple eyewear). We present VergeIO, the first EOG-based glasses that enables depth-aware eye interaction using vergence with an optimized electrode layout and novel smart glass prototype. It can distinguish between four and six depth-based eye gestures with 83-98% accuracy using personalized models in a user study across 11 users and 1,320 gesture instances. It generalizes to unseen users with an accuracy of 80-98% without any calibration. To reduce false detections, we incorporate a motion artifact detection pipeline and a preamble-based activation scheme. The system uses dry sensors without any adhesives or gel, and operates in real time with 3 mW power consumption by the sensing front-end, making it suitable for always-on sensing.
中文摘要:VergeIO推出了首款基于眼电图的智能眼镜,通过优化电极布局实现深度感知的眼动交互,在无需用户校准的情况下能以83-98%的准确率识别深度手势,且仅需3毫瓦功耗即可实现实时持续感知。
English Summary: VergeIO introduces the first EOG-based smart glasses enabling depth-aware eye interaction through an optimized electrode design, achieving 83-98% accuracy for depth-based gestures with minimal power consumption and no user calibration required.
Authors:Mathieu Dubied, Amon Lahr, Melanie N. Zeilinger, Johannes Köhler
Abstract:
In this paper, we present a robust and adaptive model predictive control (MPC) framework for uncertain nonlinear systems affected by bounded disturbances and unmodeled nonlinearities. We use Gaussian Processes (GPs) to learn the uncertain dynamics based on noisy measurements, including those collected during system operation. As a key contribution, we derive robust predictions for GP models using contraction metrics, which are incorporated in the MPC formulation. The proposed design guarantees recursive feasibility, robust constraint satisfaction and convergence to a reference state, with high probability. We provide a numerical example of a planar quadrotor subject to difficult-to-model ground effects, which highlights significant improvements achieved through the proposed robust prediction method and through online learning.
中文: 本文提出了一种鲁棒自适应模型预测控制框架,利用高斯过程学习不确定动态并通过收缩度量实现鲁棒预测,以高概率保证递归可行性、约束满足及状态收敛。
English: This paper introduces a robust adaptive MPC framework using Gaussian Processes to learn uncertain dynamics and contraction metrics for robust predictions, ensuring recursive feasibility and constraint satisfaction with high probability.
Authors:Patrick Benito Eberhard, Johannes Köhler, Oliver Hüsser, Melanie N. Zeilinger, Andrea Carron
Abstract:
Time-varying coverage control addresses the challenge of coordinating multiple agents covering an environment where regions of interest change over time. This problem has broad applications, including the deployment of autonomous taxis and coordination in search and rescue operations. The achievement of effective coverage is complicated by the presence of time-varying density functions, nonlinear agent dynamics, and stringent system and safety constraints. In this paper, we present a distributed multi-agent control framework for time-varying coverage under nonlinear constrained dynamics. Our approach integrates a reference trajectory planner and a tracking model predictive control (MPC) scheme, which operate at different frequencies within a multi-rate framework. For periodic density functions, we demonstrate closed-loop convergence to an optimal configuration of trajectories and provide formal guarantees regarding constraint satisfaction, collision avoidance, and recursive feasibility. Additionally, we propose an efficient algorithm capable of handling nonperiodic density functions, making the approach suitable for practical applications. Finally, we validate our method through hardware experiments using a fleet of four miniature race cars.
中文: 本文提出了一种分布式多智能体控制框架,通过多速率轨迹规划与跟踪模型预测控制相结合,在非线性约束下实现时变覆盖控制,保证了安全性、收敛性及可行性,并利用微型赛车车队进行了硬件实验验证。
English: This paper introduces a distributed multi-agent control framework that employs a multi-rate system with trajectory planning and tracking MPC to achieve effective time-varying coverage under nonlinear constraints, ensuring safety and convergence while being validated through hardware experiments.
Authors:Zhenyang Liu, Yongchong Gu, Sixiao Zheng, Xiangyang Xue, Yanwei Fu
Abstract:
Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods design a specific architecture like dual-system to leverage large-scale pretrained knowledge, they tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. To this end, we propose TriVLA, a unified Vision-Language-Action model with a triple-system architecture for general robot control. The vision-language module (System 2) interprets the environment through vision and language instructions. The dynamics perception module (System 3) inherently produces visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for policy learning. TriVLA utilizes pre-trained VLM model and fine-tunes pre-trained video foundation model on robot datasets along with internet human manipulation data. The subsequent policy learning module (System 1) generates fluid motor actions in real time. Experimental evaluation demonstrates that TriVLA operates at approximately 36 Hz and surpasses state-of-the-art imitation learning baselines on standard simulation benchmarks as well as challenging real-world manipulation tasks.
中文: TriVLA提出了一种三重系统视觉-语言-动作模型,通过整合环境解析、动态感知和实时策略学习,在仿真与真实机器人操作任务中以36赫兹频率运行并实现卓越性能。
English: TriVLA introduces a triple-system vision-language-action model that integrates environmental interpretation, dynamic perception, and real-time policy learning, achieving superior performance at 36 Hz in both simulated and real-world robotic manipulation tasks.
Authors:Zhe Zhang, Wen-Chin Huang, Xin Wang, Xiaoxiao Miao, Junichi Yamagishi
Abstract:
Speaker anonymization aims to protect speaker identity while preserving content information and the intelligibility of speech. However, most speaker anonymization systems (SASs) are developed and evaluated using only English, resulting in degraded utility for other languages. This paper investigates language mismatch in SASs for Japanese and Mandarin speech. First, we fine-tune a self-supervised learning (SSL)-based content encoder with Japanese speech to verify effective language adaptation. Then, we propose fine-tuning a multilingual SSL model with Japanese speech and evaluating the SAS in Japanese and Mandarin. Downstream experiments show that fine-tuning an English-only SSL model with the target language enhances intelligibility while maintaining privacy and that multilingual SSL further extends SASs' utility across different languages. These findings highlight the importance of language adaptation and multilingual pre-training of SSLs for robust multilingual speaker anonymization.
中文摘要:研究表明,通过目标语言(如日语和普通话)对自监督学习模型进行微调,可在保护说话人隐私的同时,显著提升跨语言语音匿名化系统的实用性能。
English Summary: This study demonstrates that fine-tuning self-supervised learning models with target languages like Japanese and Mandarin significantly improves speaker anonymization system performance across languages while maintaining privacy protection.
Authors:Davide Salaorni, Vincenzo De Paola, Samuele Delpero, Giovanni Dispoto, Paolo Bonetti, Alessio Russo, Giuseppe Calcagno, Francesco Trovò, Matteo Papini, Alberto Maria Metelli, Marco Mussi, Marcello Restelli
Abstract:
In recent years, \emph{Reinforcement Learning} (RL) has made remarkable progress, achieving superhuman performance in a wide range of simulated environments. As research moves toward deploying RL in real-world applications, the field faces a new set of challenges inherent to real-world settings, such as large state-action spaces, non-stationarity, and partial observability. Despite their importance, these challenges are often underexplored in current benchmarks, which tend to focus on idealized, fully observable, and stationary environments, often neglecting to incorporate real-world complexities explicitly. In this paper, we introduce \texttt{Gym4ReaL}, a comprehensive suite of realistic environments designed to support the development and evaluation of RL algorithms that can operate in real-world scenarios. The suite includes a diverse set of tasks that expose algorithms to a variety of practical challenges. Our experimental results show that, in these settings, standard RL algorithms confirm their competitiveness against rule-based benchmarks, motivating the development of new methods to fully exploit the potential of RL to tackle the complexities of real-world tasks.
Chinese: 近年来强化学习在模拟环境中表现卓越,但面临现实世界中的大状态空间和部分可观测性等挑战,为此推出了Gym4ReaL环境套件,旨在测试和改进适用于实际应用的强化学习算法。
English: Recent advances in reinforcement learning have excelled in simulations but face real-world challenges like large state spaces and partial observability, which are addressed by the new Gym4ReaL environment suite designed to test and improve RL algorithms for practical applications.
Authors:Qian Zhao, Zhuo Sun, Bin Guo, Zhiwen Yu
Abstract:
Agent-assisted memory recall is one critical research problem in the field of human-computer interaction. In conventional methods, the agent can retrieve information from its equipped memory module to help the person recall incomplete or vague memories. The limited size of memory module hinders the acquisition of complete memories and impacts the memory recall performance in practice. Memory theories suggest that the person's relevant memory can be proactively activated through some effective cues. Inspired by this, we propose a novel strategy-guided agent-assisted memory recall method, allowing the agent to transform an original query into a cue-rich one via the judiciously designed strategy to help the person recall memories. To this end, there are two key challenges. (1) How to choose the appropriate recall strategy for diverse forgetting scenarios with distinct memory-recall characteristics? (2) How to obtain the high-quality responses leveraging recall strategies, given only abstract and sparsely annotated strategy patterns? To address the challenges, we propose a Recall Router framework. Specifically, we design a 5W Recall Map to classify memory queries into five typical scenarios and define fifteen recall strategy patterns across the corresponding scenarios. We then propose a hierarchical recall tree combined with the Monte Carlo Tree Search algorithm to optimize the selection of strategy and the generation of strategy responses. We construct an instruction tuning dataset and fine-tune multiple open-source large language models (LLMs) to develop MemoCue, an agent that excels in providing memory-inspired responses. Experiments on three representative datasets show that MemoCue surpasses LLM-based methods by 17.74% in recall inspiration. Further human evaluation highlights its advantages in memory-recall applications.
中文摘要:本文提出了一种策略引导的智能体辅助记忆回溯方法,通过分层回溯树和蒙特卡洛树搜索将原始查询转化为线索丰富的提示,在记忆回溯应用中显著优于现有基于大语言模型的方法。
English Summary: The paper introduces a strategy-guided agent-assisted memory recall method that transforms queries into cue-rich prompts using a hierarchical recall tree and Monte Carlo Tree Search, significantly outperforming existing LLM-based approaches in recall inspiration.
Authors:Jiayu Li, Ziyi Ye, Guohao Jian, Zhiqiang Guo, Weizhi Ma, Qingyao Ai, Min Zhang
Abstract:
Can a recommendation model be self-aware? This paper investigates the recommender's self-awareness by quantifying its uncertainty, which provides a label-free estimation of its performance. Such self-assessment can enable more informed understanding and decision-making before the recommender engages with any users. To this end, we propose an intuitive and effective method, probability-based List Distribution uncertainty (LiDu). LiDu measures uncertainty by determining the probability that a recommender will generate a certain ranking list based on the prediction distributions of individual items. We validate LiDu's ability to represent model self-awareness in two settings: (1) with a matrix factorization model on a synthetic dataset, and (2) with popular recommendation algorithms on real-world datasets. Experimental results show that LiDu is more correlated with recommendation performance than a series of label-free performance estimators. Additionally, LiDu provides valuable insights into the dynamic inner states of models throughout training and inference. This work establishes an empirical connection between recommendation uncertainty and performance, framing it as a step towards more transparent and self-evaluating recommender systems.
本文提出了LiDu方法,通过基于概率的列表分布不确定性量化推荐模型的自我认知能力,无需依赖标签即可实现模型性能的自我评估与系统透明化。
This paper introduces LiDu, a probability-based method that quantifies recommendation model uncertainty to enable self-assessment and enhance transparency without relying on labels.
Authors:Chenghao Wang, Eric Sihite, Kaushik Venkatesh Krishnamurthy, Shreyansh Pitroda, Adarsh Salagame, Alireza Ramezani, Morteza Gharib
Abstract:
There has been significant advancement in legged robot's agility where they can show impressive acrobatic maneuvers, such as parkour. These maneuvers rely heavily on posture manipulation. To expand the stability and locomotion plasticity, we use the multi-modal ability in our legged-aerial platform, the Husky Beta, to perform thruster-assisted walking. This robot has thrusters on each of its sagittal knee joints which can be used to stabilize its frontal dynamic as it walks. In this work, we perform a simulation study of quadruped narrow-path walking with Husky $β$, where the robot will utilize its thrusters to stably walk on a narrow path. The controller is designed based on a centroidal dynamics model with thruster and foot ground contact forces as inputs. These inputs are regulated using a QP solver to be used in a model predictive control framework. In addition to narrow-path walking, we also perform a lateral push-recovery simulation to study how the thrusters can be used to stabilize the frontal dynamics.
中文:Husky Beta腿式空中机器人通过在膝关节安装推进器,实现了稳定的窄路行走和侧向抗冲击恢复能力,其模型预测控制框架优化了推进器与地面接触力的协调作用。
English: The Husky Beta legged-aerial robot uses thrusters on its knee joints to enable stable narrow-path walking and lateral push-recovery, with a model predictive control framework optimizing thruster and ground contact forces.
Authors:Yikun Li, Ngoc Tan Bui, Ting Zhang, Martin Weyssow, Chengran Yang, Xin Zhou, Jinfeng Jiang, Junkai Chen, Huihui Huang, Huu Hung Nguyen, Chiok Yew Ho, Jie Tan, Ruiyin Li, Yide Yin, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, David Lo
Abstract:
Automated vulnerability detection research has made substantial progress, yet its real-world impact remains limited. Current vulnerability datasets suffer from issues including label inaccuracy rates of 20-71%, extensive duplication, and poor coverage of critical CWE types. These issues create a significant "generalization gap" where models achieve misleading self-testing performance (measured on held-out data from the same dataset for training) by exploiting spurious correlations rather than learning true vulnerability patterns. Our analysis reveals that many models experience substantial performance drops of up to 33% when evaluated on independent data, with some performing close to random guessing. To address these limitations, we present a three-part solution. First, we introduce a manually curated test dataset, BenchVul, covering the MITRE Top 25 Most Dangerous CWEs. Second, we construct a high-quality training dataset, TitanVul, comprising 38,863 functions by aggregating seven public sources and applying deduplication and validation using a novel multi-agent LLM framework. Third, we propose a Realistic Vulnerability Generation (RVG) framework, which synthesizes context-aware vulnerability examples for underrepresented but critical CWE types through simulated development workflows. Our evaluation shows the strengths of each component in closing the generalization gap. First, BenchVul shows the limitations of self-testing: models trained on existing datasets, such as BigVul and CVEfixes, experience performance drops on BenchVul (from 0.776 to 0.519 and from 0.713 to 0.607). Second, training models on TitanVul demonstrates improved generalization, with model performance increasing from 0.584 when evaluated on the same dataset to 0.767 when tested on BenchVul. Third, supplementing TitanVul with RVG-generated data yields further gains, increasing model performance by 14.0% to 0.874.
中文: 当前自动化漏洞检测模型因数据集缺陷存在显著的泛化差距,而我们的解决方案——结合精心构建的BenchVul测试集、高质量的TitanVul训练数据和真实漏洞生成框架——有效弥补了这一差距并提升了模型性能。
English: Current automated vulnerability detection models suffer from a significant generalization gap due to flawed datasets, but our solution—combining the curated BenchVul test set, high-quality TitanVul training data, and realistic vulnerability generation—effectively closes this gap and improves model performance.
Authors:Xingyang Li, Jie Jiang, Yu Feng, Yiming Gan, Jieru Zhao, Zihan Liu, Jingwen Leng, Minyi Guo
Abstract:
Rendering is critical in fields like 3D modeling, AR/VR, and autonomous driving, where high-quality, real-time output is essential. Point-based neural rendering (PBNR) offers a photorealistic and efficient alternative to conventional methods, yet it is still challenging to achieve real-time rendering on mobile platforms. We pinpoint two major bottlenecks in PBNR pipelines: LoD search and splatting. LoD search suffers from workload imbalance and irregular memory access, making it inefficient on off-the-shelf GPUs. Meanwhile, splatting introduces severe warp divergence across GPU threads due to its inherent sparsity.
To tackle these challenges, we propose SLTarch, an algorithm-architecture co-designed framework. At its core, SLTarch introduces SLTree, a dedicated subtree-based data structure, and LTcore, a specialized hardware architecture tailored for efficient LoD search. Additionally, we co-design a divergence-free splatting algorithm with our simple yet principled hardware augmentation, SPcore, to existing PBNR accelerators. Compared to a mobile GPU, SLTarch achieves 3.9$\times$ speedup and 98\% energy savings with negligible architecture overhead. Compared to existing accelerator designs, SLTarch achieves 1.8$\times$ speedup with 54\% energy savings.
中文: SLTarch是一种算法与架构协同设计的框架,通过专用数据结构和硬件架构解决了点基神经渲染的瓶颈,在移动平台上实现了显著的加速和节能效果。
English: SLTarch is an algorithm-architecture co-designed framework that overcomes bottlenecks in point-based neural rendering by introducing specialized data structures and hardware, achieving significant speedup and energy savings on mobile platforms.
Authors:Ruiyin Li, Yiran Zhang, Xiyu Zhou, Peng Liang, Weisong Sun, Jifeng Xuan, Zhi Jin, Yang Liu
Abstract:
Software architecture design is a critical, yet inherently complex and knowledge-intensive phase of software development. It requires deep domain expertise, development experience, architectural knowledge, careful trade-offs among competing quality attributes, and the ability to adapt to evolving requirements. Traditionally, this process is time-consuming and labor-intensive, and relies heavily on architects, often resulting in limited design alternatives, especially under the pressures of agile development. While Large Language Model (LLM)-based agents have shown promising performance across various SE tasks, their application to architecture design remains relatively scarce and requires more exploration, particularly in light of diverse domain knowledge and complex decision-making. To address the challenges, we proposed MAAD (Multi-Agent Architecture Design), an automated framework that employs a knowledge-driven Multi-Agent System (MAS) for architecture design. MAAD orchestrates four specialized agents (i.e., Analyst, Modeler, Designer and Evaluator) to collaboratively interpret requirements specifications and produce architectural blueprints enriched with quality attributes-based evaluation reports. We then evaluated MAAD through a case study and comparative experiments against MetaGPT, a state-of-the-art MAS baseline. Our results show that MAAD's superiority lies in generating comprehensive architectural components and delivering insightful and structured architecture evaluation reports. Feedback from industrial architects across 11 requirements specifications further reinforces MAAD's practical usability. We finally explored the performance of the MAAD framework with three LLMs (GPT-4o, DeepSeek-R1, and Llama 3.3) and found that GPT-4o exhibits better performance in producing architecture design, emphasizing the importance of LLM selection in MAS-driven architecture design.
中文: MAAD框架采用多智能体系统自动化软件架构设计,相比现有基线在生成全面架构组件和结构化评估报告方面展现出更优性能。
English: The MAAD framework employs a multi-agent system to automate software architecture design, demonstrating superior performance in generating comprehensive components and structured evaluation reports compared to existing baselines.
Authors:Katherine M. Collins, Kartik Chandra, Adrian Weller, Jonathan Ragan-Kelley, Joshua B. Tenenbaum
Abstract:
Why do we give the explanations we do? Recent work has suggested that we should think of explanation as a kind of cooperative social interaction, between a why-question-asker and an explainer. Here, we apply this perspective to consider the role that emotion plays in this social interaction. We develop a computational framework for modeling explainers who consider the emotional impact an explanation might have on a listener. We test our framework by using it to model human intuitions about how a doctor might explain to a patient why they have a disease, taking into account the patient's propensity for regret. Our model predicts human intuitions well, better than emotion-agnostic ablations, suggesting that people do indeed reason about emotion when giving explanations.
Chinese Summary: 该研究提出一个计算框架,模拟解释者如何考量解释对听者的情绪影响,并通过医患场景验证了人们在解释时会进行情绪推理,其表现优于忽略情绪的模型。
English Summary: The study proposes a computational framework that models how explainers consider the emotional impact on listeners, demonstrating through doctor-patient scenarios that people incorporate emotional reasoning into explanations, outperforming emotion-agnostic models.
Authors:Yifan Hao, Fangning Chao, Yaqian Hao, Zhaojun Cui, Huan Bai, Haiyu Zhang, Yankai Liu, Chao Deng, Junlan Feng
Abstract:
Mathematical reasoning is a cornerstone of artificial general intelligence and a primary benchmark for evaluating the capabilities of Large Language Models (LLMs). While state-of-the-art models show promise, they often falter when faced with complex problems that demand deep conceptual understanding and intricate, multi-step deliberation. To address this challenge, we introduce JT-Math-8B, a series of open-source models comprising base, instruct, and thinking versions, built upon a systematic, multi-stage optimization framework. Our pre-training corpus is a high-quality, 210B-token dataset curated through a dedicated data pipeline that uses model-based validation to ensure quality and diversity. The Instruct Model is optimized for direct, concise answers through Supervised Fine-Tuning (SFT) and a GRPO-based reinforcement learning (RL) method. The Thinking Model is trained for complex problem-solving using a Long Chain-of-Thought (Long CoT) approach, combining SFT with a novel, multi-stage RL curriculum that progressively increases task difficulty and context length up to 32K tokens. JT-Math-8B achieves state-of-the-art results among open-source models of similar size, surpassing prominent models like OpenAI's O1-mini and GPT-4o , and demonstrating superior performance on competition-level mathematics.
Chinese: JT-Math-8B系列开源模型通过多阶段优化框架,在数学推理上实现了同类模型的最优性能,在竞赛级数学任务中超越了GPT-4o等领先模型。
English: JT-Math-8B is a series of open-source models that achieve state-of-the-art performance in mathematical reasoning through a multi-stage optimization framework, surpassing leading models like GPT-4o in competition-level tasks.
Authors:Zhiyuan Chen, Yuecong Min, Jie Zhang, Bei Yan, Jiahao Wang, Xiaozhen Wang, Shiguang Shan
Abstract:
Multi-modal Large Language Models (MLLMs) have emerged as a powerful paradigm for integrating visual and textual information, supporting a wide range of multi-modal tasks. However, these models often suffer from hallucination, producing content that appears plausible but contradicts the input content or established world knowledge. This survey offers an in-depth review of hallucination evaluation benchmarks and detection methods across Image-to-Text (I2T) and Text-to-image (T2I) generation tasks. Specifically, we first propose a taxonomy of hallucination based on faithfulness and factuality, incorporating the common types of hallucinations observed in practice. Then we provide an overview of existing hallucination evaluation benchmarks for both T2I and I2T tasks, highlighting their construction process, evaluation objectives, and employed metrics. Furthermore, we summarize recent advances in hallucination detection methods, which aims to identify hallucinated content at the instance level and serve as a practical complement of benchmark-based evaluation. Finally, we highlight key limitations in current benchmarks and detection methods, and outline potential directions for future research.
中文: 本文系统综述了多模态大语言模型中的幻觉问题,提出了分类框架并分析了评估基准与检测方法,同时指出了当前研究的局限性与未来发展方向。
English: This survey comprehensively reviews hallucination issues in Multi-modal Large Language Models, proposing a taxonomy and analyzing evaluation benchmarks and detection methods while identifying current limitations and future research directions.
Authors:Shuqing Li, Anson Y. Lam, Yun Peng, Wenxuan Wang, Michael R. Lyu
Abstract:
Graphical user interface (UI) software has undergone a fundamental transformation from traditional two-dimensional (2D) desktop/web/mobile interfaces to spatial three-dimensional (3D) environments. While existing work has made remarkable success in automated 2D software generation, such as HTML/CSS and mobile app interface code synthesis, the generation of 3D software still remains under-explored. Current methods for 3D software generation usually generate the 3D environments as a whole and cannot modify or control specific elements in the software. Furthermore, these methods struggle to handle the complex spatial and semantic constraints inherent in the real world. To address the challenges, we present Scenethesis, a novel requirement-sensitive 3D software synthesis approach that maintains formal traceability between user specifications and generated 3D software. Scenethesis is built upon ScenethesisLang, a domain-specific language that serves as a granular constraint-aware intermediate representation (IR) to bridge natural language requirements and executable 3D software. It serves both as a comprehensive scene description language enabling fine-grained modification of 3D software elements and as a formal constraint-expressive specification language capable of expressing complex spatial constraints. By decomposing 3D software synthesis into stages operating on ScenethesisLang, Scenethesis enables independent verification, targeted modification, and systematic constraint satisfaction. Our evaluation demonstrates that Scenethesis accurately captures over 80% of user requirements and satisfies more than 90% of hard constraints while handling over 100 constraints simultaneously. Furthermore, Scenethesis achieves a 42.8% improvement in BLIP-2 visual evaluation scores compared to the state-of-the-art method.
中文: 本文提出Scenethesis,一种通过领域特定语言连接用户需求与可执行代码的3D软件生成新方法,实现了细粒度控制和约束满足,相比现有方法有显著提升。
English: The paper introduces Scenethesis, a novel approach for generating 3D software that bridges user requirements and executable code through a domain-specific language, enabling fine-grained control and constraint satisfaction with significant improvements over existing methods.
Authors:Guichen Huang, Ruoyu Wang, Xiangjun Gao, Che Sun, Yuwei Wu, Shenghua Gao, Yunde Jia
Abstract:
3D Gaussian Splatting achieves high-fidelity novel view synthesis, but its application to online long-sequence scenarios is still limited. Existing methods either rely on slow per-scene optimization or fail to provide efficient incremental updates, hindering continuous performance. In this paper, we propose LongSplat, an online real-time 3D Gaussian reconstruction framework designed for long-sequence image input. The core idea is a streaming update mechanism that incrementally integrates current-view observations while selectively compressing redundant historical Gaussians. Crucial to this mechanism is our Gaussian-Image Representation (GIR), a representation that encodes 3D Gaussian parameters into a structured, image-like 2D format. GIR simultaneously enables efficient fusion of current-view and historical Gaussians and identity-aware redundancy compression. These functions enable online reconstruction and adapt the model to long sequences without overwhelming memory or computational costs. Furthermore, we leverage an existing image compression method to guide the generation of more compact and higher-quality 3D Gaussians. Extensive evaluations demonstrate that LongSplat achieves state-of-the-art efficiency-quality trade-offs in real-time novel view synthesis, delivering real-time reconstruction while reducing Gaussian counts by 44\% compared to existing per-pixel Gaussian prediction methods.
中文: LongSplat提出了一种实时3D高斯重建框架,通过流式更新机制和高斯-图像表示法,能在处理长序列时实现高效在线重建,并将高斯数量减少44%,达到最优的效能-质量平衡。
English: LongSplat introduces a real-time 3D Gaussian reconstruction framework with a streaming update mechanism and Gaussian-Image Representation, enabling efficient online processing of long sequences while reducing Gaussian counts by 44% for superior efficiency-quality balance.
Authors:An Wang, Rulin Zhou, Mengya Xu, Yiru Ye, Longfei Gou, Yiting Chang, Hao Chen, Chwee Ming Lim, Jiankun Wang, Hongliang Ren
Abstract:
Visualizing subtle vascular motions in endoscopic surgery is crucial for surgical precision and decision-making, yet remains challenging due to the complex and dynamic nature of surgical scenes. To address this, we introduce EndoControlMag, a training-free, Lagrangian-based framework with mask-conditioned vascular motion magnification tailored to endoscopic environments. Our approach features two key modules: a Periodic Reference Resetting (PRR) scheme that divides videos into short overlapping clips with dynamically updated reference frames to prevent error accumulation while maintaining temporal coherence, and a Hierarchical Tissue-aware Magnification (HTM) framework with dual-mode mask dilation. HTM first tracks vessel cores using a pretrained visual tracking model to maintain accurate localization despite occlusions and view changes. It then applies one of two adaptive softening strategies to surrounding tissues: motion-based softening that modulates magnification strength proportional to observed tissue displacement, or distance-based exponential decay that simulates biomechanical force attenuation. This dual-mode approach accommodates diverse surgical scenarios-motion-based softening excels with complex tissue deformations while distance-based softening provides stability during unreliable optical flow conditions. We evaluate EndoControlMag on our EndoVMM24 dataset spanning four different surgery types and various challenging scenarios, including occlusions, instrument disturbance, view changes, and vessel deformations. Quantitative metrics, visual assessments, and expert surgeon evaluations demonstrate that EndoControlMag significantly outperforms existing methods in both magnification accuracy and visual quality while maintaining robustness across challenging surgical conditions. The code, dataset, and video results are available at https://szupc.github.io/EndoControlMag/.
Chinese: 本文提出EndoControlMag框架,通过自适应掩码调节放大和周期性参考帧重置技术,在无需训练的情况下增强内窥镜手术中细微血管运动的可视化效果,在多种复杂手术场景下均展现出卓越的精度和鲁棒性。
English: This paper introduces EndoControlMag, a training-free framework for magnifying subtle vascular motions in endoscopic surgery through adaptive mask-conditioned magnification and periodic reference resetting, demonstrating superior accuracy and robustness across diverse surgical scenarios.
Authors:Mengya Xu, Rulin Zhou, An Wang, Chaoyang Lyu, Zhen Li, Ning Zhong, Hongliang Ren
Abstract:
Intraoperative bleeding during Endoscopic Submucosal Dissection (ESD) poses significant risks, demanding precise, real-time localization and continuous monitoring of the bleeding source for effective hemostatic intervention. In particular, endoscopists have to repeatedly flush to clear blood, allowing only milliseconds to identify bleeding sources, an inefficient process that prolongs operations and elevates patient risks. However, current Artificial Intelligence (AI) methods primarily focus on bleeding region segmentation, overlooking the critical need for accurate bleeding source detection and temporal tracking in the challenging ESD environment, which is marked by frequent visual obstructions and dynamic scene changes. This gap is widened by the lack of specialized datasets, hindering the development of robust AI-assisted guidance systems. To address these challenges, we introduce BleedOrigin-Bench, the first comprehensive ESD bleeding source dataset, featuring 1,771 expert-annotated bleeding sources across 106,222 frames from 44 procedures, supplemented with 39,755 pseudo-labeled frames. This benchmark covers 8 anatomical sites and 6 challenging clinical scenarios. We also present BleedOrigin-Net, a novel dual-stage detection-tracking framework for the bleeding source localization in ESD procedures, addressing the complete workflow from bleeding onset detection to continuous spatial tracking. We compare with widely-used object detection models (YOLOv11/v12), multimodal large language models, and point tracking methods. Extensive evaluation demonstrates state-of-the-art performance, achieving 96.85% frame-level accuracy ($\pm\leq8$ frames) for bleeding onset detection, 70.24% pixel-level accuracy ($\leq100$ px) for initial source detection, and 96.11% pixel-level accuracy ($\leq100$ px) for point tracking.
Chinese: 本研究推出了首个全面的ESD手术出血源检测数据集BleedOrigin-Bench,并提出了双阶段检测-追踪框架BleedOrigin-Net,在存在视觉干扰的情况下实现了出血源定位与追踪的最优性能。
English: This study introduces BleedOrigin-Bench, the first comprehensive dataset for bleeding source detection in ESD procedures, and proposes BleedOrigin-Net, a dual-stage detection-tracking framework that achieves state-of-the-art performance in localizing and tracking bleeding sources despite visual challenges.
Authors:Shuiguang Deng, Di Yu, Changze Lv, Xin Du, Linshan Jiang, Xiaofan Zhao, Wentao Tong, Xiaoqing Zheng, Weijia Fang, Peng Zhao, Gang Pan, Schahram Dustdar, Albert Y. Zomaya
Abstract:
The convergence of artificial intelligence and edge computing has spurred growing interest in enabling intelligent services directly on resource-constrained devices. While traditional deep learning models require significant computational resources and centralized data management, the resulting latency, bandwidth consumption, and privacy concerns have exposed critical limitations in cloud-centric paradigms. Brain-inspired computing, particularly Spiking Neural Networks (SNNs), offers a promising alternative by emulating biological neuronal dynamics to achieve low-power, event-driven computation. This survey provides a comprehensive overview of Edge Intelligence based on SNNs (EdgeSNNs), examining their potential to address the challenges of on-device learning, inference, and security in edge scenarios. We present a systematic taxonomy of EdgeSNN foundations, encompassing neuron models, learning algorithms, and supporting hardware platforms. Three representative practical considerations of EdgeSNN are discussed in depth: on-device inference using lightweight SNN models, resource-aware training and updating under non-stationary data conditions, and secure and privacy-preserving issues. Furthermore, we highlight the limitations of evaluating EdgeSNNs on conventional hardware and introduce a dual-track benchmarking strategy to support fair comparisons and hardware-aware optimization. Through this study, we aim to bridge the gap between brain-inspired learning and practical edge deployment, offering insights into current advancements, open challenges, and future research directions. To the best of our knowledge, this is the first dedicated and comprehensive survey on EdgeSNNs, providing an essential reference for researchers and practitioners working at the intersection of neuromorphic computing and edge intelligence.
中文: 本综述探讨了基于脉冲神经网络的边缘智能(EdgeSNNs),将其作为受大脑启发的低功耗解决方案,用于解决设备端学习与推理中的资源限制和安全挑战,并引入双轨基准测试策略以实现公平评估。
English: This survey explores Edge Intelligence based on Spiking Neural Networks (EdgeSNNs) as a brain-inspired, low-power solution for on-device learning and inference, addressing challenges like resource constraints and security while introducing a dual-track benchmarking strategy for fair evaluation.
Authors:Dongming Jin, Weisong Sun, Jiangping Huang, Peng Liang, Jifeng Xuan, Yang Liu, Zhi Jin
Abstract:
Requirements development is a critical phase as it is responsible for providing a clear understanding of what stakeholders need. It involves collaboration among stakeholders to extract explicit requirements and address potential conflicts, which is time-consuming and labor-intensive. Recently, multi-agent systems for software development have attracted much attention. However, existing research provides limited support for requirements development and overlooks the injection of human knowledge into agents and the human-agent collaboration. % To address these issues, this paper proposes a knowledge-driven multi-agent framework for intelligent requirement development, named iReDev. iReDev features: iReDev consists of six knowledge-driven agents to support the entire requirements development. They collaboratively perform various tasks to produce a software requirements specification. iReDev focuses on integrating human knowledge for agents, enabling them to simulate real-world stakeholders. iReDev uses an event-driven communication mechanism based on an artifact pool. Agents continuously monitor the pool and autonomously trigger the next action based on its changes, enabling iReDev to handle new requirements quickly. iReDev introduces a human-in-the-loop mechanism to support human-agent collaboration, ensuring that the generated artifacts align with the expectations of stakeholders. We evaluated the generated artifacts and results show that iReDev outperforms existing baselines in multiple aspects. We further envision three key directions and hope this work can facilitate the development of intelligent requirements development.
中文: 本文提出iReDev知识驱动多智能体框架,通过事件驱动机制整合人类知识并实现人机协同,显著提升了智能需求开发的效率与质量,验证了其优于现有基准方法的性能。
English: This paper introduces iReDev, a knowledge-driven multi-agent framework that enhances intelligent requirements development by integrating human knowledge and enabling human-agent collaboration through an event-driven mechanism, demonstrating superior performance over existing baselines.
Authors:Lance Ying, Katherine M. Collins, Prafull Sharma, Cedric Colas, Kaiya Ivy Zhao, Adrian Weller, Zenna Tavares, Phillip Isola, Samuel J. Gershman, Jacob D. Andreas, Thomas L. Griffiths, Francois Chollet, Kelsey R. Allen, Joshua B. Tenenbaum
Abstract:
Human intelligence exhibits a remarkable capacity for rapid adaptation and effective problem-solving in novel and unfamiliar contexts. We argue that this profound adaptability is fundamentally linked to the efficient construction and refinement of internal representations of the environment, commonly referred to as world models, and we refer to this adaptation mechanism as world model induction. However, current understanding and evaluation of world models in artificial intelligence (AI) remains narrow, often focusing on static representations learned from training on massive corpora of data, instead of the efficiency and efficacy in learning these representations through interaction and exploration within a novel environment. In this Perspective, we provide a view of world model induction drawing on decades of research in cognitive science on how humans learn and adapt so efficiently; we then call for a new evaluation framework for assessing adaptive world models in AI. Concretely, we propose a new benchmarking paradigm based on suites of carefully designed games with genuine, deep and continually refreshing novelty in the underlying game structures -- we refer to this class of games as novel games. We detail key desiderata for constructing these games and propose appropriate metrics to explicitly challenge and evaluate the agent's ability for rapid world model induction. We hope that this new evaluation framework will inspire future evaluation efforts on world models in AI and provide a crucial step towards developing AI systems capable of human-like rapid adaptation and robust generalization -- a critical component of artificial general intelligence.
中文: 摘要提出人类适应能力源于高效的世界模型归纳,并呼吁通过新颖游戏构建新的AI评估框架来测试快速模型学习能力,以推动通用人工智能发展。
English: The abstract proposes that human adaptability stems from efficient world model induction and calls for a new AI evaluation framework using novel games to assess rapid model learning, aiming to advance artificial general intelligence.
Authors:Lionel Wong, Katherine M. Collins, Lance Ying, Cedegao E. Zhang, Adrian Weller, Tobias Gerstenberg, Timothy O'Donnell, Alexander K. Lew, Jacob D. Andreas, Joshua B. Tenenbaum, Tyler Brooke-Wilson
Abstract:
When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea -- a ``Model Synthesis Architecture'' (MSA) -- using language models to implement global relevance-based retrieval and model synthesis and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA as a model of human judgments on a novel reasoning dataset. The dataset -- built around a `Model Olympics` domain of sports vignettes -- tests models' capacity for human-like, open-ended reasoning by requiring (i) judgments about novel causal structures described in language; (ii) drawing on large bodies of background knowledge; and (iii) doing both in light of observations that introduce arbitrary novel variables. Our MSA approach captures human judgments better than language model-only baselines, under both direct and chain-of-thought generations from the LM that supports model synthesis. These results suggest that MSAs can be implemented in a way that mirrors people's ability to deliver locally coherent reasoning over globally relevant variables, offering a path to understanding and replicating human reasoning in open-ended domains.
中文: 人们通过分布式与符号化表征的结合为新颖情境构建定制化心理模型,这一计算实现的模型合成架构在模拟人类开放式推理方面优于纯语言模型基线。
English: People construct tailored mental models for novel situations using distributed and symbolic representations, which is computationally implemented through a Model Synthesis Architecture that outperforms language model-only baselines in capturing human reasoning.
Authors:Jiaxuan Xie, Chengwu Liu, Ye Yuan, Siqi Li, Zhiping Xiao, Ming Zhang
Abstract:
Efficient and accurate autoformalization methods, which leverage large-scale datasets of extensive natural language mathematical problems to construct formal language datasets, are key to advancing formal mathematical reasoning. In this paper, we propose an autoformalization pipeline based on large language models with error feedback, achieving a fully automatic and training-free formalization approach. Using this pipeline, we curate an Olympiad-level dataset aligning natural language problems with Lean formalizations. The dataset comprises $3,922$ mathematical problems in natural language and $9,787$ in Lean, of which $64.46\%$ were assessed as at least above-average quality, making it suitable as a benchmark for automated theorem provers. Additionally, we investigate the formalization and reasoning capabilities of various LLMs and empirically demonstrate that few-shot learning, error feedback, and increasing sampling numbers enhance the autoformalization process. Experiments of three automated theorem provers on the \dataset\ dataset also highlight its challenging nature and its value as a benchmark for formal reasoning tasks.
Chinese: 本文提出了一种基于大语言模型和错误反馈的自动形式化流程,构建了一个奥林匹克数学级别的自然语言与Lean形式化数据集,该数据集可作为自动定理证明器的挑战性基准,并证明少量样本学习和错误反馈能提升形式化效果。
English: This paper introduces an autoformalization pipeline using large language models with error feedback to create a high-quality Olympiad-level dataset of natural language and Lean formalizations, which serves as a challenging benchmark for automated theorem provers and demonstrates that few-shot learning and error feedback improve the process.
Authors:Shixun Wu, Jinwen Pan, Jinyang Liu, Jiannan Tian, Ziwei Qiu, Jiajun Huang, Kai Zhao, Xin Liang, Sheng Di, Zizhong Chen, Franck Cappello
Abstract:
As high-performance computing architectures evolve, more scientific computing workflows are being deployed on advanced computing platforms such as GPUs. These workflows can produce raw data at extremely high throughputs, requiring urgent high-ratio and low-latency error-bounded data compression solutions. In this paper, we propose cuSZ-Hi, an optimized high-ratio GPU-based scientific error-bounded lossy compressor with a flexible, domain-irrelevant, and fully open-source framework design. Our novel contributions are: 1) We maximally optimize the parallelized interpolation-based data prediction scheme on GPUs, enabling the full functionalities of interpolation-based scientific data prediction that are adaptive to diverse data characteristics; 2) We thoroughly explore and investigate lossless data encoding techniques, then craft and incorporate the best-fit lossless encoding pipelines for maximizing the compression ratio of cuSZ-Hi; 3) We systematically evaluate cuSZ-Hi on benchmarking datasets together with representative baselines. Compared to existing state-of-the-art scientific lossy compressors, with comparative or better throughput than existing high-ratio scientific error-bounded lossy compressors on GPUs, cuSZ-Hi can achieve up to 249% compression ratio improvement under the same error bound, and up to 215% compression ratio improvement under the same decompression data PSNR.
中文:本文提出的cuSZ-Hi是一种优化的基于GPU的误差有损压缩器,相比现有方法,在科学数据压缩中显著提升了压缩比并保持了高吞吐量。
English: This paper introduces cuSZ-Hi, an optimized GPU-based error-bounded lossy compressor that significantly enhances compression ratios and maintains high throughput for scientific data compared to existing methods.
Authors:Yuhu Bai, Jiangning Zhang, Yunkang Cao, Guangyuan Lu, Qingdong He, Xiangtai Li, Guanzhong Tian
Abstract:
With the advent of vision-language models (e.g., CLIP) in zero- and few-shot settings, CLIP has been widely applied to zero-shot anomaly detection (ZSAD) in recent research, where the rare classes are essential and expected in many applications. This study introduces \textbf{FiSeCLIP} for ZSAD with training-free \textbf{CLIP}, combining the feature matching with the cross-modal alignment. Testing with the entire dataset is impractical, while batch-based testing better aligns with real industrial needs, and images within a batch can serve as mutual reference points. Accordingly, FiSeCLIP utilizes other images in the same batch as reference information for the current image. However, the lack of labels for these references can introduce ambiguity, we apply text information to \textbf{fi}lter out noisy features. In addition, we further explore CLIP's inherent potential to restore its local \textbf{se}mantic correlation, adapting it for fine-grained anomaly detection tasks to enable a more accurate filtering process. Our approach exhibits superior performance for both anomaly classification and segmentation on anomaly detection benchmarks, building a stronger baseline for the direction, e.g., on MVTec-AD, FiSeCLIP outperforms the SOTA AdaCLIP by +4.6\%$\uparrow$/+5.7\%$\uparrow$ in segmentation metrics AU-ROC/$F_1$-max.
中文: 本研究提出FiSeCLIP方法,通过结合特征匹配与跨模态对齐,利用批次内图像作为参考并采用文本引导的特征过滤机制,在零样本异常检测任务中实现了优越的性能表现。
English: This study introduces FiSeCLIP, a training-free method that enhances zero-shot anomaly detection by combining feature matching with cross-modal alignment and leveraging batch-based references with text-guided filtering for improved performance.
Authors:Haoye Chai, Yuan Yuan, Yong Li
Abstract:
Accurate modeling and simulation of mobile networks are essential for enabling intelligent and cost-effective network optimization. In this paper, we propose MobiWorld, a generative world model designed to support high-fidelity and flexible environment simulation for mobile network planning and optimization. Unlike traditional predictive models constrained by limited generalization capabilities, MobiWorld exhibits strong universality by integrating heterogeneous data sources, including sensors, mobile devices, and base stations, as well as multimodal data types such as sequences and images. It is capable of generating both network element-level observations (e.g., traffic load, user distribution) and system-level performance indicators (e.g., throughput, energy consumption) to support a wide range of planning and optimization tasks. Built upon advanced diffusion models, MobiWorld offers powerful controllable generation capabilities by modeling the joint distribution between mobile network data and diverse conditional factors including spatio temporal contexts, user behaviors, and optimization policies. This enables accurate simulation of dynamic network states under varying policy configurations, providing optimization agents with precise environmental feedback and facilitating effective decision-making without relying on costly real-network interactions. We demonstrate the effectiveness of MobiWorld in a collaborative energy-saving scenario, where an agent uses observations and rewards generated by MobiWorld to optimize base station sleep and user offloading policies. Experimental results show that MobiWorld exhibits strong controllable generation performance and outperforms traditional methods in energy optimization.
中文: MobiWorld是一种生成式世界模型,通过整合异构数据和扩散模型,为移动网络提供高保真、可控的仿真环境,以支持高效规划与优化,并在节能场景中展现出优于传统方法的性能。
English: MobiWorld is a generative world model that integrates heterogeneous data and diffusion models to enable high-fidelity, controllable simulation of mobile networks for efficient planning and optimization, as demonstrated by its superior performance in energy-saving scenarios.
Authors:Florian Kofler, Marcel Rosier, Mehdi Astaraki, Hendrik Möller, Ilhem Isra Mekki, Josef A. Buchner, Anton Schmick, Arianna Pfiffer, Eva Oswald, Lucas Zimmer, Ezequiel de la Rosa, Sarthak Pati, Julian Canisius, Arianna Piffer, Ujjwal Baid, Mahyar Valizadeh, Akis Linardos, Jan C. Peeken, Surprosanna Shit, Felix Steinbauer, Daniel Rueckert, Rolf Heckemann, Spyridon Bakas, Jan Kirschke, Constantin von See, Ivan Ezhov, Marie Piraud, Benedikt Wiestler, Bjoern Menze
Abstract:
BrainLesion Suite is a versatile toolkit for building modular brain lesion image analysis pipelines in Python. Following Pythonic principles, BrainLesion Suite is designed to provide a 'brainless' development experience, minimizing cognitive effort and streamlining the creation of complex workflows for clinical and scientific practice. At its core is an adaptable preprocessing module that performs co-registration, atlas registration, and optional skull-stripping and defacing on arbitrary multi-modal input images. BrainLesion Suite leverages algorithms from the BraTS challenge to synthesize missing modalities, inpaint lesions, and generate pathology-specific tumor segmentations. BrainLesion Suite also enables quantifying segmentation model performance, with tools such as panoptica to compute lesion-wise metrics. Although BrainLesion Suite was originally developed for image analysis pipelines of brain lesions such as glioma, metastasis, and multiple sclerosis, it can be adapted for other biomedical image analysis applications. The individual BrainLesion Suite packages and tutorials are accessible on GitHub.
BrainLesion Suite 是一个 Python 工具包,通过模块化预处理、缺失模态合成和分割评估工具简化脑部病变分析流程的构建,并可扩展至其他生物医学影像应用。
BrainLesion Suite is a Python toolkit that simplifies building brain lesion analysis pipelines with modular preprocessing, missing modality synthesis, and segmentation evaluation tools, adaptable for broader biomedical imaging applications.
Authors:Sen Wang, Shao Zeng, Tianjun Gu, Zhizhong Zhang, Ruixin Zhang, Shouhong Ding, Jingyun Zhang, Jun Wang, Xin Tan, Yuan Xie, Lizhuang Ma
Abstract:
Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.
中文摘要:提出的GEFU框架通过扩散模型实现零样本泛化,并结合SCUF方法引入光照感知提示和语义一致性机制,在图像质量及分类、检测、分割等下游任务中均超越现有最优方法。
English Summary: The proposed GEFU framework bridges low-light enhancement and understanding by leveraging diffusion models for zero-shot generalization and introducing SCUF with illumination-aware prompts and semantic consistency to outperform state-of-the-art methods in both image quality and downstream tasks.
Authors:Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
Abstract:
The construction of high-quality datasets is a cornerstone of modern text-to-speech (TTS) systems. However, the increasing scale of available data poses significant challenges, including storage constraints. To address these issues, we propose a TTS corpus construction method based on active learning. Unlike traditional feed-forward and model-agnostic corpus construction approaches, our method iteratively alternates between data collection and model training, thereby focusing on acquiring data that is more informative for model improvement. This approach enables the construction of a data-efficient corpus. Experimental results demonstrate that the corpus constructed using our method enables higher-quality speech synthesis than corpora of the same size.
中文摘要:本文提出的基于主动学习的TTS语料库构建方法,通过迭代整合数据采集与模型训练,能构建出比同等规模传统语料库更高效、合成质量更优的语音数据集。
English Summary: The proposed active learning-based method for TTS corpus construction iteratively combines data collection with model training to create more efficient datasets that yield higher-quality speech synthesis compared to same-sized traditional corpora.
Authors:Kui Jiang, Shiyu Liu, Junjun Jiang, Hongxun Yao, Xiaopeng Fan
Abstract:
Audio-driven talking head generation holds significant potential for film production. While existing 3D methods have advanced motion modeling and content synthesis, they often produce rendering artifacts, such as motion blur, temporal jitter, and local penetration, due to limitations in representing stable, fine-grained motion fields. Through systematic analysis, we reformulate talking head generation into a unified framework comprising three steps: video preprocessing, motion representation, and rendering reconstruction. This framework underpins our proposed M2DAO-Talker, which addresses current limitations via multi-granular motion decoupling and alternating optimization. Specifically, we devise a novel 2D portrait preprocessing pipeline to extract frame-wise deformation control conditions (motion region segmentation masks, and camera parameters) to facilitate motion representation. To ameliorate motion modeling, we elaborate a multi-granular motion decoupling strategy, which independently models non-rigid (oral and facial) and rigid (head) motions for improved reconstruction accuracy. Meanwhile, a motion consistency constraint is developed to ensure head-torso kinematic consistency, thereby mitigating penetration artifacts caused by motion aliasing. In addition, an alternating optimization strategy is designed to iteratively refine facial and oral motion parameters, enabling more realistic video generation. Experiments across multiple datasets show that M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness versus TalkingGaussian while with 150 FPS inference speed. Our project homepage is https://m2dao-talker.github.io/M2DAO-Talk.github.io.
中文摘要:本研究提出M2DAO-Talker框架,通过多粒度运动解耦和交替优化策略解决音频驱动说话头生成中的渲染伪影问题,在多个数据集上实现了最先进的生成质量和真实感表现。
English Summary: This study introduces M2DAO-Talker, a unified framework for audio-driven talking head generation that addresses rendering artifacts through multi-granular motion decoupling and alternating optimization, achieving state-of-the-art performance with significant improvements in generation quality and realness.
Authors:Benjamin Hodo, Tommaso Polonelli, Amirhossein Moallemi, Luca Benini, Michele Magno
Abstract:
We present EdgeCodec, an end-to-end neural compressor for barometric data collected from wind turbine blades. EdgeCodec leverages a heavily asymmetric autoencoder architecture, trained with a discriminator and enhanced by a Residual Vector Quantizer to maximize compression efficiency. It achieves compression rates between 2'560:1 and 10'240:1 while maintaining a reconstruction error below 3%, and operates in real time on the GAP9 microcontroller with bitrates ranging from 11.25 to 45 bits per second. Bitrates can be selected on a sample-by-sample basis, enabling on-the-fly adaptation to varying network conditions. In its highest compression mode, EdgeCodec reduces the energy consumption of wireless data transmission by up to 2.9x, significantly extending the operational lifetime of deployed sensor units.
中文:EdgeCodec是一种用于风力涡轮机叶片气压数据的端到端神经压缩器,可实现高达10,240:1的压缩比且误差低于3%,在微控制器上运行并将传输能耗降低2.9倍。
English: EdgeCodec is an end-to-end neural compressor for wind turbine barometric data, achieving compression ratios up to 10,240:1 with under 3% error and reducing transmission energy consumption by 2.9x on microcontrollers.
Authors:Yinuo Zhao, Jiale Yuan, Zhiyuan Xu, Xiaoshuai Hao, Xinyi Zhang, Kun Wu, Zhengping Che, Chi Harold Liu, Jian Tang
Abstract:
Recent advances in vision-language models (VLMs) have significantly improved performance in embodied tasks such as goal decomposition and visual comprehension. However, providing accurate rewards for robotic manipulation without fine-tuning VLMs remains challenging due to the absence of domain-specific robotic knowledge in pre-trained datasets and high computational costs that hinder real-time applicability. To address this, we propose $\mathrm{T}^2$-VLM, a novel training-free, temporally consistent framework that generates accurate rewards through tracking the status changes in VLM-derived subgoals. Specifically, our method first queries the VLM to establish spatially aware subgoals and an initial completion estimate before each round of interaction. We then employ a Bayesian tracking algorithm to update the goal completion status dynamically, using subgoal hidden states to generate structured rewards for reinforcement learning (RL) agents. This approach enhances long-horizon decision-making and improves failure recovery capabilities with RL. Extensive experiments indicate that $\mathrm{T}^2$-VLM achieves state-of-the-art performance in two robot manipulation benchmarks, demonstrating superior reward accuracy with reduced computation consumption. We believe our approach not only advances reward generation techniques but also contributes to the broader field of embodied AI. Project website: https://t2-vlm.github.io/.
中文摘要:本文提出的T²-VLM框架通过贝叶斯跟踪算法动态更新子目标状态来生成结构化奖励,无需微调视觉语言模型即可解决机器人操作任务中的奖励生成难题,在降低计算消耗的同时实现了最优性能。
English Summary: The proposed T²-VLM framework addresses the challenge of generating accurate robotic manipulation rewards without VLM fine-tuning by dynamically tracking subgoal completion through Bayesian updates, achieving state-of-the-art performance with reduced computational costs.
Authors:Han Li, Fei Liu, Zhenkun Wang, Qingfu Zhang
Abstract:
Vehicle routing problems (VRPs) are central to combinatorial optimization with significant practical implications. Recent advancements in neural combinatorial optimization (NCO) have demonstrated promising results by leveraging neural networks to solve VRPs, yet the exploration of model scaling within this domain remains underexplored. Inspired by the success of model scaling in large language models (LLMs), this study introduces a Large Routing Model with 1 billion parameters (LRM-1B), designed to address diverse VRP scenarios. We present a comprehensive evaluation of LRM-1B across multiple problem variants, distributions, and sizes, establishing state-of-the-art results. Our findings reveal that LRM-1B not only adapts to different VRP challenges but also showcases superior performance, outperforming existing models. Additionally, we explore the scaling behavior of neural routing models from 1M to 1B parameters. Our analysis confirms power-law between multiple model factors and performance, offering critical insights into the optimal configurations for foundation neural routing solvers.
中文摘要:本研究提出的LRM-1B十亿参数神经路由模型在多种车辆路径问题上实现了最优性能,同时揭示了模型参数与性能之间的幂律缩放关系。
English Summary: This study introduces LRM-1B, a billion-parameter neural routing model that achieves state-of-the-art performance across diverse vehicle routing problems while revealing power-law scaling relationships between model factors and performance.
Authors:Meng Xu, Frank Neumann, Aneta Neumann, Yew Soon Ong
Abstract:
Real-world optimization often demands diverse, high-quality solutions. Quality-Diversity (QD) optimization is a multifaceted approach in evolutionary algorithms that aims to generate a set of solutions that are both high-performing and diverse. QD algorithms have been successfully applied across various domains, providing robust solutions by exploring diverse behavioral niches. However, their application has primarily focused on static problems, with limited exploration in the context of dynamic combinatorial optimization problems. Furthermore, the theoretical understanding of QD algorithms remains underdeveloped, particularly when applied to learning heuristics instead of directly learning solutions in complex and dynamic combinatorial optimization domains, which introduces additional challenges. This paper introduces a novel QD framework for dynamic scheduling problems. We propose a map-building strategy that visualizes the solution space by linking heuristic genotypes to their behaviors, enabling their representation on a QD map. This map facilitates the discovery and maintenance of diverse scheduling heuristics. Additionally, we conduct experiments on both fixed and dynamically changing training instances to demonstrate how the map evolves and how the distribution of solutions unfolds over time. We also discuss potential future research directions that could enhance the learning process and broaden the applicability of QD algorithms to dynamic combinatorial optimization challenges.
中文: 本文针对动态调度问题提出了一种新颖的质量多样性框架,通过构建映射策略将启发式基因型与其行为关联,从而在静态和动态训练实例中实现多样化调度启发式的可视化与维护。
English: This paper introduces a novel Quality-Diversity framework for dynamic scheduling problems, employing a map-building strategy to visualize and maintain diverse scheduling heuristics across both static and dynamic training instances.
Authors:Wei Cui, Haoyu Wang, Wenkang Qin, Yijie Guo, Gang Han, Wen Zhao, Jiahang Cao, Zhang Zhang, Jiaru Zhong, Jingkai Sun, Pihai Sun, Shuai Shi, Botuo Jiang, Jiahao Ma, Jiaxu Wang, Hao Cheng, Zhichao Liu, Yang Wang, Zheng Zhu, Guan Huang, Jian Tang, Qiang Zhang
Abstract:
Humanoid robot technology is advancing rapidly, with manufacturers introducing diverse heterogeneous visual perception modules tailored to specific scenarios. Among various perception paradigms, occupancy-based representation has become widely recognized as particularly suitable for humanoid robots, as it provides both rich semantic and 3D geometric information essential for comprehensive environmental understanding. In this work, we present Humanoid Occupancy, a generalized multimodal occupancy perception system that integrates hardware and software components, data acquisition devices, and a dedicated annotation pipeline. Our framework employs advanced multi-modal fusion techniques to generate grid-based occupancy outputs encoding both occupancy status and semantic labels, thereby enabling holistic environmental understanding for downstream tasks such as task planning and navigation. To address the unique challenges of humanoid robots, we overcome issues such as kinematic interference and occlusion, and establish an effective sensor layout strategy. Furthermore, we have developed the first panoramic occupancy dataset specifically for humanoid robots, offering a valuable benchmark and resource for future research and development in this domain. The network architecture incorporates multi-modal feature fusion and temporal information integration to ensure robust perception. Overall, Humanoid Occupancy delivers effective environmental perception for humanoid robots and establishes a technical foundation for standardizing universal visual modules, paving the way for the widespread deployment of humanoid robots in complex real-world scenarios.
中文: Humanoid Occupancy系统提出了一种融合硬件、软件和专用数据集的多模态感知框架,通过占据表征为人形机器人提供鲁棒的环境理解能力,同时解决了运动干扰和遮挡等特有挑战。
English: The Humanoid Occupancy system introduces a multimodal perception framework integrating hardware, software, and a dedicated dataset to enable robust environmental understanding for humanoid robots through occupancy-based representations, addressing unique challenges like occlusion and kinematic interference.
Authors:Chia-Ming Lee, Bo-Cheng Qiu, Cheng-Jun Kang, Yi-Hsuan Wu, Jun-Lin Chen, Yu-Fan Lin, Yi-Shiuan Chou, Chih-Chung Hsu
Abstract:
Predicting online video popularity faces a critical challenge: prediction drift, where models trained on historical data rapidly degrade due to evolving viral trends and user behaviors. To address this temporal distribution shift, we propose an Anchored Multi-modal Clustering and Feature Generation (AMCFG) framework that discovers temporally-invariant patterns across data distributions. Our approach employs multi-modal clustering to reveal content structure, then leverages Large Language Models (LLMs) to generate semantic Anchor Features, such as audience demographics, content themes, and engagement patterns that transcend superficial trend variations. These semantic anchors, combined with cluster-derived statistical features, enable prediction based on stable principles rather than ephemeral signals. Experiments demonstrate that AMCFG significantly enhances both predictive accuracy and temporal robustness, achieving superior performance on out-of-distribution data and providing a viable solution for real-world video popularity prediction.
中文: AMCFG框架通过多模态聚类和大语言模型生成跨时间稳定的语义锚点特征,有效解决了在线视频流行度预测中的分布漂移问题,显著提升了预测准确性和时序鲁棒性。
English: The AMCFG framework tackles prediction drift in online video popularity by generating temporally-invariant semantic anchors through multi-modal clustering and LLMs, significantly improving accuracy and robustness against evolving trends.
Authors:Chia-Ming Lee, Bo-Cheng Qiu, Ting-Yao Chen, Ming-Han Sun, Fang-Ying Lin, Jung-Tse Tsai, I-An Tsai, Yu-Fan Lin, Chih-Chung Hsu
Abstract:
Multi-source CT-scan classification suffers from domain shifts that impair cross-source generalization. While preprocessing pipelines combining Spatial-Slice Feature Learning (SSFL++) and Kernel-Density-based Slice Sampling (KDS) have shown empirical success, the mechanisms underlying their domain robustness remain underexplored. This study analyzes how this input-space standardization manages the trade-off between local discriminability and cross-source generalization. The SSFL++ and KDS pipeline performs spatial and temporal standardization to reduce inter-source variance, effectively mapping disparate inputs into a consistent target space. This preemptive alignment mitigates domain shift and simplifies the learning task for network optimization. Experimental validation demonstrates consistent improvements across architectures, proving the benefits stem from the preprocessing itself. The approach's effectiveness was validated by securing first place in a competitive challenge, supporting input-space standardization as a robust and practical solution for multi-institutional medical imaging.
中文摘要:该研究表明,SSFL++与KDS预处理流程通过空间和时间标准化有效减少多源CT扫描中的域偏移,从而提升跨源泛化能力并简化网络优化任务。
English Summary: The study reveals that the SSFL++ and KDS preprocessing pipeline enhances multi-source CT-scan classification by standardizing spatial and temporal features to reduce domain shifts, thereby improving cross-source generalization and network optimization.
Authors:Shengyuan Wang, Jie Feng, Tianhui Liu, Dan Pei, Yong Li
Abstract:
Large language models (LLMs) possess extensive world knowledge, including geospatial knowledge, which has been successfully applied to various geospatial tasks such as mobility prediction and social indicator prediction. However, LLMs often generate inaccurate geospatial knowledge, leading to geospatial hallucinations (incorrect or inconsistent representations of geospatial information) that compromise their reliability. While the phenomenon of general knowledge hallucination in LLMs has been widely studied, the systematic evaluation and mitigation of geospatial hallucinations remain largely unexplored. To address this gap, we propose a comprehensive evaluation framework for geospatial hallucinations, leveraging structured geospatial knowledge graphs for controlled assessment. Through extensive evaluation across 20 advanced LLMs, we uncover the hallucinations in their geospatial knowledge. Building on these insights, we introduce a dynamic factuality aligning method based on Kahneman-Tversky Optimization (KTO) to mitigate geospatial hallucinations in LLMs, leading to a performance improvement of over 29.6% on the proposed benchmark. Extensive experimental results demonstrate the effectiveness of our benchmark and learning algorithm in enhancing the trustworthiness of LLMs in geospatial knowledge and reasoning tasks.
中文摘要:大语言模型存在严重的地理空间幻觉问题,本文提出评估框架和优化方法,将此类错误减少超过29.6%,有效提升了地理空间知识推理的可信度。
English summary: Large language models exhibit significant geospatial hallucinations despite their extensive knowledge, prompting the development of an evaluation framework and optimization method that reduces these errors by over 29.6% to enhance reliability.
Authors:Md Abrar Jahin, Shahriar Soudeep, M. F. Mridha, Muhammad Mostafa Monowar, Md. Abdul Hamid
Abstract:
Real-time particle transverse momentum ($p_T$) estimation in high-energy physics demands algorithms that are both efficient and accurate under strict hardware constraints. Static machine learning models degrade under high pileup and lack physics-aware optimization, while generic graph neural networks (GNNs) often neglect domain structure critical for robust $p_T$ regression. We propose a physics-informed GNN framework that systematically encodes detector geometry and physical observables through four distinct graph construction strategies that systematically encode detector geometry and physical observables: station-as-node, feature-as-node, bending angle-centric, and pseudorapidity ($η$)-centric representations. This framework integrates these tailored graph structures with a novel Message Passing Layer (MPL), featuring intra-message attention and gated updates, and domain-specific loss functions incorporating $p_{T}$-distribution priors. Our co-design methodology yields superior accuracy-efficiency trade-offs compared to existing baselines. Extensive experiments on the CMS Trigger Dataset validate the approach: a station-informed EdgeConv model achieves a state-of-the-art MAE of 0.8525 with $\ge55\%$ fewer parameters than deep learning baselines, especially TabNet, while an $η$-centric MPL configuration also demonstrates improved accuracy with comparable efficiency. These results establish the promise of physics-guided GNNs for deployment in resource-constrained trigger systems.
中文摘要:提出的物理信息图神经网络框架通过定制化图结构和领域特定损失函数,在实时粒子横向动量估计中相比现有方法实现了更优的准确性与效率平衡。
English Summary: The proposed physics-informed graph neural network framework, incorporating tailored graph structures and domain-specific loss functions, achieves superior accuracy and efficiency for real-time particle transverse momentum estimation compared to existing methods.
Authors:Yu-Fan Lin, Chia-Ming Lee, Chih-Chung Hsu
Abstract:
Shadows are a common factor degrading image quality. Single-image shadow removal (SR), particularly under challenging indirect illumination, is hampered by non-uniform content degradation and inherent ambiguity. Consequently, traditional methods often fail to simultaneously recover intra-shadow details and maintain sharp boundaries, resulting in inconsistent restoration and blurring that negatively affect both downstream applications and the overall viewing experience. To overcome these limitations, we propose the DenseSR, approaching the problem from a dense prediction perspective to emphasize restoration quality. This framework uniquely synergizes two key strategies: (1) deep scene understanding guided by geometric-semantic priors to resolve ambiguity and implicitly localize shadows, and (2) high-fidelity restoration via a novel Dense Fusion Block (DFB) in the decoder. The DFB employs adaptive component processing-using an Adaptive Content Smoothing Module (ACSM) for consistent appearance and a Texture-Boundary Recuperation Module (TBRM) for fine textures and sharp boundaries-thereby directly tackling the inconsistent restoration and blurring issues. These purposefully processed components are effectively fused, yielding an optimized feature representation preserving both consistency and fidelity. Extensive experimental results demonstrate the merits of our approach over existing methods. Our code can be available on https://github$.$com/VanLinLin/DenseSR
中文: DenseSR框架通过结合基于几何语义先验的深度场景理解与新型密集融合块,解决了单图像阴影去除中的模糊性问题,在恢复细节和保持清晰边界方面优于现有方法。
English: The DenseSR framework addresses single-image shadow removal by integrating deep scene understanding with geometric-semantic priors and a novel Dense Fusion Block to resolve ambiguity and restore details while maintaining sharp boundaries, outperforming existing methods.
Authors:Franco Oberti, Stefano Di Carlo, Alessandro Savino
Abstract:
The Controller Area Network (CAN) protocol, essential for automotive embedded systems, lacks inherent security features, making it vulnerable to cyber threats, especially with the rise of autonomous vehicles. Traditional security measures offer limited protection, such as payload encryption and message authentication. This paper presents a novel Intrusion Detection System (IDS) designed for the CAN environment, utilizing Hardware Performance Counters (HPCs) to detect anomalies indicative of cyber attacks. A RISC-V-based CAN receiver is simulated using the gem5 simulator, processing CAN frame payloads with AES-128 encryption as FreeRTOS tasks, which trigger distinct HPC responses. Key HPC features are optimized through data extraction and correlation analysis to enhance classification efficiency. Results indicate that this approach could significantly improve CAN security and address emerging challenges in automotive cybersecurity.
中文: 本文提出了一种基于硬件性能计数器的创新入侵检测系统,用于控制器局域网的网络安全防护,仿真结果表明该方法能有效提升汽车网络的安全性。
English: This paper introduces a novel Intrusion Detection System for the Controller Area Network that utilizes Hardware Performance Counters to detect cyber threats, demonstrating through simulations that it can significantly enhance automotive cybersecurity.
Authors:Daniëlle Schuman, Mark V. Seebode, Tobias Rohe, Maximilian Balthasar Mansky, Michael Schroedl-Baumann, Jonas Stein, Claudia Linnhoff-Popien, Florian Krellner
Abstract:
Exploiting the fact that samples drawn from a quantum annealer inherently follow a Boltzmann-like distribution, annealing-based Quantum Boltzmann Machines (QBMs) have gained increasing popularity in the quantum research community. While they harbor great promises for quantum speed-up, their usage currently stays a costly endeavor, as large amounts of QPU time are required to train them. This limits their applicability in the NISQ era. Following the idea of Noè et al. (2024), who tried to alleviate this cost by incorporating parallel quantum annealing into their unsupervised training of QBMs, this paper presents an improved version of parallel quantum annealing that we employ to train QBMs in a supervised setting. Saving qubits to encode the inputs, the latter setting allows us to test our approach on medical images from the MedMNIST data set (Yang et al., 2023), thereby moving closer to real-world applicability of the technology. Our experiments show that QBMs using our approach already achieve reasonable results, comparable to those of similarly-sized Convolutional Neural Networks (CNNs), with markedly smaller numbers of epochs than these classical models. Our parallel annealing technique leads to a speed-up of almost 70 % compared to regular annealing-based BM executions.
中文摘要:本文提出了一种改进的并行量子退火方法,用于监督式量子玻尔兹曼机训练,在医学图像数据上取得了与经典模型相当的结果,同时训练周期显著减少且速度提升近70%。
English Summary: This paper introduces an enhanced parallel quantum annealing method for training Quantum Boltzmann Machines in a supervised setting, achieving results comparable to classical models on medical image data with significantly fewer epochs and a 70% speed-up in training time.
Authors:Shuang Liang, Lili Chen, Wensheng Gan, Philip S. Yu, Shengjie Zhao
Abstract:
Compared to frequent pattern mining, sequential pattern mining emphasizes the temporal aspect and finds broad applications across various fields. However, numerous studies treat temporal events as single time points, neglecting their durations. Time-interval-related pattern (TIRP) mining is introduced to address this issue and has been applied to healthcare analytics, stock prediction, etc. Typically, mining all patterns is not only computationally challenging for accurate forecasting but also resource-intensive in terms of time and memory. Targeting the extraction of time-interval-related patterns based on specific criteria can improve data analysis efficiency and better align with customer preferences. Therefore, this paper proposes a novel algorithm called TaTIRP to discover Targeted Time-Interval Related Patterns. Additionally, we develop multiple pruning strategies to eliminate redundant extension operations, thereby enhancing performance on large-scale datasets. Finally, we conduct experiments on various real-world and synthetic datasets to validate the accuracy and efficiency of the proposed algorithm.
中文: 本文提出TaTIRP算法,通过剪枝策略高效挖掘目标时间间隔相关模式,解决了计算难题并提升数据分析效率,在多个数据集上进行了实验验证。
English: This paper introduces the TaTIRP algorithm to efficiently mine targeted time-interval related patterns using pruning strategies, addressing computational challenges and improving data analysis efficiency, with experimental validation on multiple datasets.
Authors:Alessio Caviglia, Filippo Marostica, Alessio Carpegna, Alessandro Savino, Stefano Di Carlo
Abstract:
Hardware accelerators are essential for achieving low-latency, energy-efficient inference in edge applications like image recognition. Spiking Neural Networks (SNNs) are particularly promising due to their event-driven and temporally sparse nature, making them well-suited for low-power Field Programmable Gate Array (FPGA)-based deployment. This paper explores using the open-source Spiker+ framework to generate optimized SNNs accelerators for handwritten digit recognition on the MNIST dataset. Spiker+ enables high-level specification of network topologies, neuron models, and quantization, automatically generating deployable HDL. We evaluate multiple configurations and analyze trade-offs relevant to edge computing constraints.
硬件加速器对于实现低延迟、高能效的边缘推理至关重要,其中脉冲神经网络因其事件驱动和时序稀疏特性特别适合FPGA部署,本文通过Spiker+框架自动生成针对MNIST手写数字识别的优化SNN加速器验证了这一优势。
Hardware accelerators are crucial for low-latency, energy-efficient edge inference, with Spiking Neural Networks (SNNs) being ideal for FPGA deployment due to their event-driven sparsity, as demonstrated by the Spiker+ framework's automated generation of optimized SNN accelerators for MNIST digit recognition.
Authors:Haoyang Li, Nana Hou, Yuchen Hu, Jixun Yao, Sabato Marco Siniscalchi, Eng Siong Chng
Abstract:
This work investigates speech enhancement (SE) from the perspective of language models (LMs). We propose a novel method that leverages Direct Preference Optimization (DPO) to improve the perceptual quality of enhanced speech. Using UTMOS, a neural MOS prediction model, as a proxy for human ratings, our approach guides optimization toward perceptually preferred outputs. This differs from existing LM-based SE methods that focus on maximizing the likelihood of clean speech tokens, which may misalign with human perception and degrade quality despite low prediction error. Experiments on the 2020 Deep Noise Suppression Challenge test sets demonstrate that applying DPO to a pretrained LM-based SE model yields consistent improvements across various speech quality metrics, with relative gains of up to 56%. To our knowledge, this is the first application of DPO to SE and the first to incorporate proxy perceptual feedback into LM-based SE training, pointing to a promising direction for perceptually aligned SE.
本研究提出了一种新颖的语音增强方法,通过直接偏好优化结合感知代理模型,将输出与人类偏好对齐而非传统似然最大化,实现了语音质量的显著提升。
This study introduces a novel speech enhancement method using Direct Preference Optimization guided by a proxy perceptual model, achieving significant quality improvements by aligning outputs with human preferences rather than traditional likelihood maximization.
Authors:Md Abrar Jahin, Taufikur Rahman Fuad, M. F. Mridha, Nafiz Fahad, Md. Jakir Hossen
Abstract:
Federated Learning (FL) faces inherent challenges in balancing model performance, privacy preservation, and communication efficiency, especially in non-IID decentralized environments. Recent approaches either sacrifice formal privacy guarantees, incur high overheads, or overlook quantum-enhanced expressivity. We introduce AdeptHEQ-FL, a unified hybrid classical-quantum FL framework that integrates (i) a hybrid CNN-PQC architecture for expressive decentralized learning, (ii) an adaptive accuracy-weighted aggregation scheme leveraging differentially private validation accuracies, (iii) selective homomorphic encryption (HE) for secure aggregation of sensitive model layers, and (iv) dynamic layer-wise adaptive freezing to minimize communication overhead while preserving quantum adaptability. We establish formal privacy guarantees, provide convergence analysis, and conduct extensive experiments on the CIFAR-10, SVHN, and Fashion-MNIST datasets. AdeptHEQ-FL achieves a $\approx 25.43\%$ and $\approx 14.17\%$ accuracy improvement over Standard-FedQNN and FHE-FedQNN, respectively, on the CIFAR-10 dataset. Additionally, it reduces communication overhead by freezing less important layers, demonstrating the efficiency and practicality of our privacy-preserving, resource-aware design for FL.
中文: AdeptHEQ-FL是一个融合经典与量子计算的联邦学习框架,通过差分隐私和选择性加密确保隐私安全,利用自适应层冻结减少通信开销,在基准数据集上显著提升了模型精度和效率。
English: AdeptHEQ-FL is a hybrid classical-quantum federated learning framework that enhances model accuracy, ensures privacy through differential privacy and selective encryption, and reduces communication overhead via adaptive layer freezing, outperforming existing methods on benchmark datasets.
Authors:Mohamed Elmoghany, Ryan Rossi, Seunghyun Yoon, Subhojyoti Mukherjee, Eslam Bakr, Puneet Mathur, Gang Wu, Viet Dac Lai, Nedim Lipka, Ruiyi Zhang, Varun Manjunatha, Chien Nguyen, Daksh Dangi, Abel Salinas, Mohammad Taesiri, Hongjie Chen, Xiaolei Huang, Joe Barrow, Nesreen Ahmed, Hoda Eldardiry, Namyong Park, Yu Wang, Jaemin Cho, Anh Totti Nguyen, Zhengzhong Tu, Thien Nguyen, Dinesh Manocha, Mohamed Elhoseiny, Franck Dernoncourt
Abstract:
Despite the significant progress that has been made in video generative models, existing state-of-the-art methods can only produce videos lasting 5-16 seconds, often labeled "long-form videos". Furthermore, videos exceeding 16 seconds struggle to maintain consistent character appearances and scene layouts throughout the narrative. In particular, multi-subject long videos still fail to preserve character consistency and motion coherence. While some methods can generate videos up to 150 seconds long, they often suffer from frame redundancy and low temporal diversity. Recent work has attempted to produce long-form videos featuring multiple characters, narrative coherence, and high-fidelity detail. We comprehensively studied 32 papers on video generation to identify key architectural components and training strategies that consistently yield these qualities. We also construct a comprehensive novel taxonomy of existing methods and present comparative tables that categorize papers by their architectural designs and performance characteristics.
中文: 当前视频生成模型局限于16秒内的短片,难以保持角色一致性和场景连贯性,为此我们系统分析了32篇论文,旨在找出能生成高质量长视频的有效架构组件和训练策略。
English: Current video generation models are limited to short clips under 16 seconds with challenges in maintaining character consistency and scene coherence, prompting a comprehensive analysis of 32 papers to identify effective architectural components and training strategies for producing high-quality long-form videos.
Authors:Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, Jianye Hao
Abstract:
Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.
中文: 强化学习能提升大语言模型的推理能力,但现有方法因采用同策略学习而效率低下;为此提出的ReMix异策略方法,以极低的训练成本在数学推理基准上实现了领先性能。
English: Reinforcement Learning enhances Large Language Models' reasoning, but existing on-policy methods are inefficient, prompting the development of ReMix, an off-policy approach that significantly reduces training costs while achieving state-of-the-art performance on math benchmarks.
Authors:Daojie Peng, Jiahang Cao, Qiang Zhang, Jun Ma
Abstract:
Object navigation in open-world environments remains a formidable and pervasive challenge for robotic systems, particularly when it comes to executing long-horizon tasks that require both open-world object detection and high-level task planning. Traditional methods often struggle to integrate these components effectively, and this limits their capability to deal with complex, long-range navigation missions. In this paper, we propose LOVON, a novel framework that integrates large language models (LLMs) for hierarchical task planning with open-vocabulary visual detection models, tailored for effective long-range object navigation in dynamic, unstructured environments. To tackle real-world challenges including visual jittering, blind zones, and temporary target loss, we design dedicated solutions such as Laplacian Variance Filtering for visual stabilization. We also develop a functional execution logic for the robot that guarantees LOVON's capabilities in autonomous navigation, task adaptation, and robust task completion. Extensive evaluations demonstrate the successful completion of long-sequence tasks involving real-time detection, search, and navigation toward open-vocabulary dynamic targets. Furthermore, real-world experiments across different legged robots (Unitree Go2, B2, and H1-2) showcase the compatibility and appealing plug-and-play feature of LOVON.
中文: LOVON是一种创新框架,通过结合大语言模型与开放词汇视觉检测技术,使机器人能够在动态环境中实现长距离目标导航,并采用拉普拉斯方差滤波等专门方案解决视觉抖动和临时目标丢失等实际问题。
English: LOVON is a novel framework that integrates large language models with open-vocabulary visual detection to enable robots to perform long-range object navigation in dynamic environments, addressing challenges like visual jittering and temporary target loss through specialized solutions such as Laplacian Variance Filtering.
Authors:Zheng Lian, Licai Sun, Haoyu Chen, Zebang Cheng, Fan Zhang, Ziyu Jia, Ziyang Ma, Fei Ma, Xiaojiang Peng, Jianhua Tao
Abstract:
With the recent success of large language models, Explainable Multimodal Emotion Recognition (EMER), also known as Descriptive MER (DMER), has attracted growing attention from researchers. Unlike traditional discriminative methods that rely on predefined emotion taxonomies, EMER aims to describe a person's emotional state using free-form natural language, thereby enabling fine-grained and interpretable emotion representations. However, this free-form prediction paradigm introduces significant challenges in evaluation. Existing approaches either depend on ground-truth descriptions, which require extensive manual annotations and often fail to capture the full complexity of human emotions, or simplify the evaluation task by shifting focus from assessing descriptions to evaluating emotion labels. However, this simplification overlooks critical aspects such as emotional temporal dynamics, intensity, and uncertainty. To address these limitations, we propose EMER-Ranker, a novel evaluation strategy that reformulates the traditional ``prediction-ground truth'' comparison into the ``prediction-prediction'' comparison, eliminating the need for ground-truth descriptions. We then apply the Bradley-Terry algorithm to convert pairwise comparison outcomes into model-level rankings. Additionally, we explore the potential for automatic preference prediction and introduce EMER-Preference, the first preference dataset specifically designed for human emotions. Our work advances the field of EMER and lays the foundation for more intelligent human-computer interaction systems.
中文摘要:本文提出EmoPrefer,首次探索利用多模态大语言模型实现高效的人类情感偏好标注,通过构建首个情感偏好数据集和评估基准,解决了描述性多模态情感识别中的评估难题。
English Summary: This paper introduces EmoPrefer, a pioneering study that explores using multimodal large language models to efficiently annotate human emotion preferences, addressing challenges in Descriptive Multimodal Emotion Recognition by creating the first emotion preference dataset and benchmark.
Authors:Zheng Lian, Licai Sun, Lan Chen, Haoyu Chen, Zebang Cheng, Fan Zhang, Ziyu Jia, Ziyang Ma, Fei Ma, Xiaojiang Peng, Jianhua Tao
Abstract:
Descriptive Multimodal Emotion Recognition (DMER) has garnered increasing research attention. Unlike traditional discriminative paradigms that rely on predefined emotion taxonomies, DMER aims to describe human emotional state using free-form natural language, enabling finer-grained and more interpretable emotion representations. However, this free-form prediction paradigm introduces new challenges regarding its evaluation. Previous works depend on ground-truth descriptions, but emotions are inherently tied to diverse human behaviors, and generating a comprehensive and accurate description is inherently demanding. Other researchers reformulate this problem into a more tractable human preference learning task, but pairwise preference annotation involves substantial manual effort. This leads to a question: can we leverage multimodal LLMs (MLLMs) to achieve more cost-efficient preference annotation? To answer this, we propose EmoPrefer, a pioneering work exploring the potential of LLMs in decoding human emotion preferences. Specifically, we construct the first emotion preference dataset, EmoPrefer-Data, featuring high-quality preference annotations from experts. Additionally, we introduce EmoPrefer-Bench, which evaluates the performance of various MLLMs and prompting techniques in preference prediction, while also revealing new strategies to enhance their performance. To the best of our knowledge, this is the first work exploring the capabilities of LLMs in understanding human emotion preferences. Our work advances the field of DMER and lays the foundation for more intelligent human-computer interaction.
中文摘要:本文提出EmoPrefer,首次探索利用多模态大语言模型实现高效的人类情感偏好标注,通过构建首个情感偏好数据集和评估基准,解决了描述性多模态情感识别中的评估难题。
English Summary: This paper introduces EmoPrefer, a pioneering study that explores using multimodal large language models to efficiently annotate human emotion preferences, addressing challenges in Descriptive Multimodal Emotion Recognition by creating the first emotion preference dataset and benchmark.
Authors:Chia-Ming Lee, Bo-Cheng Qiu, Ting-Yao Chen, Ming-Han Sun, Fang-Ying Lin, Jung-Tse Tsai, I-An Tsai, Yu-Fan Lin, Chih-Chung Hsu
Abstract:
We present our solution for the Multi-Source COVID-19 Detection Challenge, which classifies chest CT scans from four distinct medical centers. To address multi-source variability, we employ the Spatial-Slice Feature Learning (SSFL) framework with Kernel-Density-based Slice Sampling (KDS). Our preprocessing pipeline combines lung region extraction, quality control, and adaptive slice sampling to select eight representative slices per scan. We compare EfficientNet and Swin Transformer architectures on the validation set. The EfficientNet model achieves an F1-score of 94.68%, compared to the Swin Transformer's 93.34%. The results demonstrate the effectiveness of our KDS-based pipeline on multi-source data and highlight the importance of dataset balance in multi-institutional medical imaging evaluation.
中文摘要:我们针对多源COVID-19检测挑战提出的解决方案,采用基于核密度切片采样的空间切片特征学习框架处理四家医疗中心的胸部CT扫描,通过EfficientNet模型取得94.68%的F1值,验证了该流程在多源数据上的有效性。
English Summary: Our solution for the Multi-Source COVID-19 Detection Challenge uses the SSFL framework with KDS sampling to classify chest CT scans from four medical centers, achieving a 94.68% F1-score with EfficientNet and demonstrating the pipeline's effectiveness on multi-source data.
Authors:Duc Thien Hua, Mohammadali Mohammadi, Hien Quoc Ngo, Michail Matthaiou
Abstract:
We investigate the integration of beyond diagonal reconfigurable intelligent surfaces (BDRISs) into cell free massive multiple input multiple output (CFmMIMO) systems to enhance simultaneous wireless information and power transfer (SWIPT). To simultaneously support two groups of users energy receivers (ERs) and information receivers (IRs) without sacrificing time frequency resources, a subset of access points (APs) is dedicated to serving ERs with the aid of a BDRIS, while the remaining APs focus on supporting IRs. A protective partial zero forcing precoding technique is implemented at the APs to manage the non coherent interference between the ERs and IRs. Subsequently, closed form expressions for the spectral efficiency of the IRs and the average sum of harvested energy at the ERs are leveraged to formulate a comprehensive optimization problem. This problem jointly optimizes the AP selection, AP power control, and scattering matrix design at the BDRIS, all based on long term statistical channel state information. This challenging problem is then effectively transformed into more tractable forms. To solve these sub problems, efficient algorithms are proposed, including a heuristic search for the scattering matrix design, as well as successive convex approximation and deep reinforcement learning methods for the joint AP mode selection and power control design. Numerical results show that a BDRIS with a group or fully connected architecture achieves significant energy harvesting gains over the conventional diagonal RIS, especially delivering up to a seven fold increase in the average sum of harvested energy when a heuristic based scattering matrix design is employed.
中文: 本研究将超对角可重构智能表面集成到无蜂窝大规模MIMO系统中,以增强同步无线信息和能量传输,通过提出联合优化的高效算法,相比传统对角智能表面实现了显著的能量收集增益。
English: This study integrates beyond diagonal reconfigurable intelligent surfaces into cell-free massive MIMO systems to enhance simultaneous wireless information and power transfer, proposing efficient algorithms for joint optimization that achieve significant energy harvesting gains over conventional diagonal RIS.
Authors:Xiaohan Li, Ziren Gong, Fabio Tosi, Matteo Poggi, Stefano Mattoccia, Dong Liu, Jun Wu
Abstract:
3D Gaussian Splatting (3DGS) has recently gained popularity in SLAM applications due to its fast rendering and high-fidelity representation. However, existing 3DGS-SLAM systems have predominantly focused on indoor environments and relied on active depth sensors, leaving a gap for large-scale outdoor applications. We present BGS-SLAM, the first binocular 3D Gaussian Splatting SLAM system designed for outdoor scenarios. Our approach uses only RGB stereo pairs without requiring LiDAR or active sensors. BGS-SLAM leverages depth estimates from pre-trained deep stereo networks to guide 3D Gaussian optimization with a multi-loss strategy enhancing both geometric consistency and visual quality. Experiments on multiple datasets demonstrate that BGS-SLAM achieves superior tracking accuracy and mapping performance compared to other 3DGS-based solutions in complex outdoor environments.
中文: BGS-SLAM是首个专为户外场景设计的双目3D高斯溅射SLAM系统,仅使用RGB立体图像无需主动传感器,通过多损失优化实现了卓越的跟踪与建图性能。
English: BGS-SLAM is the first binocular 3D Gaussian Splatting SLAM system designed for outdoor environments, utilizing only RGB stereo pairs without active sensors and achieving superior tracking and mapping performance through multi-loss optimization.
Authors:Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, Matteo Poggi
Abstract:
We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.
中文: Ov3R是一种基于RGB视频的开放词汇语义三维重建创新框架,通过直接融入CLIP语义实现全局一致的几何结构和细粒度语义对齐,在密集重建和三维分割方面达到领先水平,推动了空间人工智能的发展。
English: Ov3R is an innovative framework for open-vocabulary semantic 3D reconstruction from RGB videos, integrating CLIP semantics directly to achieve globally consistent geometry and fine-grained semantic alignment, setting new standards in dense reconstruction and 3D segmentation for Spatial AI.
Authors:Maxim Henry, Adrien Deliège, Anthony Cioppa, Marc Van Droogenbroeck
Abstract:
Convolutional Neural Networks (CNN) are widely used in many computer vision tasks. Yet, their increasing size and complexity pose significant challenges for efficient deployment on resource-constrained platforms. Hence, network pruning has emerged as an effective way of reducing the size and computational requirements of neural networks by removing redundant or unimportant parameters. However, a fundamental challenge with pruning consists in optimally removing redundancies without degrading performance. Most existing pruning techniques overlook structural dependencies across feature maps within a layer, resulting in suboptimal pruning decisions. In this work, we introduce LinDeps, a novel post-pruning method, i.e., a pruning method that can be applied on top of any pruning technique, which systematically identifies and removes redundant filters via linear dependency analysis. Particularly, LinDeps applies pivoted QR decomposition to feature maps to detect and prune linearly dependent filters. Then, a novel signal recovery mechanism adjusts the next layer's kernels to preserve compatibility and performance without requiring any fine-tuning. Our experiments on CIFAR-10 and ImageNet with VGG and ResNet backbones demonstrate that LinDeps improves compression rates of existing pruning techniques while preserving performances, leading to a new state of the art in CNN pruning. We also benchmark LinDeps in low-resource setups where no retraining can be performed, which shows significant pruning improvements and inference speedups over a state-of-the-art method. LinDeps therefore constitutes an essential add-on for any current or future pruning technique.
Chinese: LinDeps是一种新颖的后剪枝方法,通过线性依赖分析去除卷积神经网络中的冗余滤波器,在无需微调的情况下提升压缩率并保持性能,已在CIFAR-10和ImageNet数据集上得到验证。
English: LinDeps is a novel post-pruning method that uses linear dependency analysis to remove redundant filters in CNNs, enhancing compression rates and maintaining performance without fine-tuning, as validated on CIFAR-10 and ImageNet datasets.
Authors:Linye Wei, Jiajun Tang, Fan Fei, Boxin Shi, Runsheng Wang, Meng Li
Abstract:
3D Gaussian Splatting (3DGS) enables high-quality rendering of 3D scenes and is getting increasing adoption in domains like autonomous driving and embodied intelligence. However, 3DGS still faces major efficiency challenges when faced with high frame rate requirements and resource-constrained edge deployment. To enable efficient 3DGS, in this paper, we propose LS-Gaussian, an algorithm/hardware co-design framework for lightweight streaming 3D rendering. LS-Gaussian is motivated by the core observation that 3DGS suffers from substantial computation redundancy and stalls. On one hand, in practical scenarios, high-frame-rate 3DGS is often applied in settings where a camera observes and renders the same scene continuously but from slightly different viewpoints. Therefore, instead of rendering each frame separately, LS-Gaussian proposes a viewpoint transformation algorithm that leverages inter-frame continuity for efficient sparse rendering. On the other hand, as different tiles within an image are rendered in parallel but have imbalanced workloads, frequent hardware stalls also slow down the rendering process. LS-Gaussian predicts the workload for each tile based on viewpoint transformation to enable more balanced parallel computation and co-designs a customized 3DGS accelerator to support the workload-aware mapping in real-time. Experimental results demonstrate that LS-Gaussian achieves 5.41x speedup over the edge GPU baseline on average and up to 17.3x speedup with the customized accelerator, while incurring only minimal visual quality degradation.
中文: LS-Gaussian是一种算法与硬件协同设计的框架,通过视角变换算法和定制化加速器解决3D高斯泼溅的计算冗余与负载不均问题,在保持视觉质量的同时实现了显著加速效果。
English: LS-Gaussian is a co-designed algorithm and hardware framework that addresses 3D Gaussian Splatting's computational redundancy and workload imbalance through viewpoint transformation and a customized accelerator, achieving significant speedup with minimal quality loss.
Authors:Lakpa Tamang, Mohamed Reda Bouadjenek, Richard Dazeley, Sunil Aryal
Abstract:
In the field of Machine Learning (ML) and data-driven applications, one of the significant challenge is the change in data distribution between the training and deployment stages, commonly known as distribution shift. This paper outlines different mechanisms for handling two main types of distribution shifts: (i) Covariate shift: where the value of features or covariates change between train and test data, and (ii) Concept/Semantic-shift: where model experiences shift in the concept learned during training due to emergence of novel classes in the test phase. We sum up our contributions in three folds. First, we formalize distribution shifts, recite on how the conventional method fails to handle them adequately and urge for a model that can simultaneously perform better in all types of distribution shifts. Second, we discuss why handling distribution shifts is important and provide an extensive review of the methods and techniques that have been developed to detect, measure, and mitigate the effects of these shifts. Third, we discuss the current state of distribution shift handling mechanisms and propose future research directions in this area. Overall, we provide a retrospective synopsis of the literature in the distribution shift, focusing on OOD data that had been overlooked in the existing surveys.
中文: 本文针对机器学习中的分布偏移问题,系统阐述了其类型,综述了检测与缓解方法,并提出了提升模型鲁棒性的未来研究方向。
English: This paper addresses the challenge of distribution shift in machine learning by formalizing its types, reviewing detection and mitigation methods, and proposing future research directions to improve model robustness.
Authors:Ziren Gong, Xiaohan Li, Fabio Tosi, Youmin Zhang, Stefano Mattoccia, Jun Wu, Matteo Poggi
Abstract:
This paper presents DINO-SLAM, a DINO-informed design strategy to enhance neural implicit (Neural Radiance Field -- NeRF) and explicit representations (3D Gaussian Splatting -- 3DGS) in SLAM systems through more comprehensive scene representations. Purposely, we rely on a Scene Structure Encoder (SSE) that enriches DINO features into Enhanced DINO ones (EDINO) to capture hierarchical scene elements and their structural relationships. Building upon it, we propose two foundational paradigms for NeRF and 3DGS SLAM systems integrating EDINO features. Our DINO-informed pipelines achieve superior performance on the Replica, ScanNet, and TUM compared to state-of-the-art methods.
中文: 本文提出DINO-SLAM方法,通过场景结构编码器增强DINO特征,改进神经隐式和显式SLAM系统的场景表示能力,在多个基准数据集上实现了领先性能。
English: This paper introduces DINO-SLAM, a method that enhances neural implicit and explicit SLAM representations by integrating enriched DINO features through a Scene Structure Encoder, achieving superior performance on benchmark datasets.
Authors:Altaf Allah Abbassi, Leuson Da Silva, Amin Nikanjam, Foutse Khomh
Abstract:
Large Language Models (LLMs) for code generation evolve rapidly through fine-tuning, merging, or new model releases. However, such updates can introduce regressions, not only in correctness but also in code quality and performance. To address this, we present ReCatcher, a regression testing framework for Python code generation. ReCatcher systematically compares two LLMs, typically a current model and a candidate update, across three dimensions: logical correctness, static code quality, and execution performance. We apply ReCatcher to assess regressions across three update scenarios, fine-tuning, merging, and model release, using CodeLlama, DeepSeek-Coder, and GPT-4o. Our evaluation shows that fine-tuning with cross-language datasets increases syntax errors by up to 12%. Merging with general-purpose models like Llama2 leads to regressions in correctness by up to 18%. GPT-4o introduces regressions of up to 50% in handling missing imports compared to GPT-3.5-turbo, while GPT-4o-mini suffers up to 80% performance degradation in execution time versus GPT-4o. Overall, logical correctness, performance, and error handling (e.g., syntax errors and missing imports) are the most regression-prone areas. Comparing ReCatcher with baseline solutions, it presents better and consistent accuracy across logical and performance aspects. ReCatcher highlights the importance of systematic regression evaluation before adopting new models, while assisting researchers and practitioners in making more informed update decisions.
中文摘要:ReCatcher是一个回归测试框架,可系统评估Python代码生成模型在逻辑正确性、代码质量和执行性能三个维度的表现,揭示了微调、模型合并和新版本发布中存在的显著回归问题,其评估效果优于基准方法。
English Summary: ReCatcher is a regression testing framework that systematically evaluates Python code generation models across logical correctness, code quality, and execution performance, revealing significant regressions in fine-tuning, model merging, and new releases while outperforming baseline methods.
Authors:Zongyi Li, Samuel Lanthaler, Catherine Deng, Michael Chen, Yixuan Wang, Kamyar Azizzadenesheli, Anima Anandkumar
Abstract:
Machine learning (ML) models have emerged as a promising approach for solving partial differential equations (PDEs) in science and engineering. Previous ML models typically cannot generalize outside the training data; for example, a trained ML model for the Navier-Stokes equations only works for a fixed Reynolds number ($Re$) on a pre-defined domain. To overcome these limitations, we propose a data augmentation scheme based on scale-consistency properties of PDEs and design a scale-informed neural operator that can model a wide range of scales. Our formulation leverages the facts: (i) PDEs can be rescaled, or more concretely, a given domain can be re-scaled to unit size, and the parameters and the boundary conditions of the PDE can be appropriately adjusted to represent the original solution, and (ii) the solution operators on a given domain are consistent on the sub-domains. We leverage these facts to create a scale-consistency loss that encourages matching the solutions evaluated on a given domain and the solution obtained on its sub-domain from the rescaled PDE. Since neural operators can fit to multiple scales and resolutions, they are the natural choice for incorporating scale-consistency loss during training of neural PDE solvers. We experiment with scale-consistency loss and the scale-informed neural operator model on the Burgers' equation, Darcy Flow, Helmholtz equation, and Navier-Stokes equations. With scale-consistency, the model trained on $Re$ of 1000 can generalize to $Re$ ranging from 250 to 10000, and reduces the error by 34% on average of all datasets compared to baselines.
Chinese: 为解决传统机器学习模型在求解偏微分方程时无法泛化到训练数据之外的问题,本研究基于方程的尺度一致性特性提出了一种数据增强方案,设计了尺度感知神经算子,显著提升了模型在不同尺度上的泛化能力,平均误差降低了34%。
English: To overcome the limitations of traditional machine learning models that fail to generalize beyond their training data for solving partial differential equations, this study introduces a scale-informed neural operator using a data augmentation scheme based on scale-consistency properties, which significantly improves generalization across various scales and reduces errors by 34% on average.
Authors:Linye Wei, Shuzhang Zhong, Songqiang Xu, Runsheng Wang, Ru Huang, Meng Li
Abstract:
Large language model (LLM)-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of LLMs challenges the real-time ASR requirements. Although speculative decoding has been explored for better decoding efficiency, they usually ignore the key characteristics of the ASR task and achieve limited speedup. To further reduce the real-time ASR latency, in this paper, we propose a novel speculative decoding framework specialized for ASR, dubbed SpecASR. SpecASR is developed based on our core observation that ASR decoding is audio-conditioned, which results in high output alignment between small and large ASR models, even given output mismatches in intermediate decoding steps. Therefore, SpecASR features an adaptive draft sequence generation process that dynamically modifies the draft sequence length to maximize the token acceptance length. SpecASR further proposes a draft sequence recycling strategy that reuses the previously generated draft sequence to reduce the draft ASR model latency. Moreover, a two-pass sparse token tree generation algorithm is also proposed to balance the latency of draft and target ASR models. With extensive experimental results, we demonstrate SpecASR achieves 3.04x-3.79x and 1.25x-1.84x speedup over the baseline autoregressive decoding and speculative decoding, respectively, without any loss in recognition accuracy.
中文:SpecASR是一种专为自动语音识别设计的新型推测解码框架,通过动态调整草稿序列长度和复用先前序列,在不损失识别准确性的情况下实现了显著加速。
English: SpecASR is a novel speculative decoding framework designed for automatic speech recognition that dynamically adjusts draft sequence length and reuses previous sequences to achieve significant speedup without compromising accuracy.
Authors:Tan Bui, Yan Naing Tun, Thanh Phuc Nguyen, Yindu Su, Ferdian Thung, Yikun Li, Han Wei Ang, Yide Yin, Frank Liauw, Lwin Khin Shar, Eng Lieh Ouh, Ting Zhang, David Lo
Abstract:
Code reuse is common in modern software development, but it can also spread vulnerabilities when developers unknowingly copy risky code. The code fragments that preserve the logic of known vulnerabilities are known as vulnerable code clones (VCCs). Detecting those VCCs is a critical but challenging task. Existing VCC detection tools often rely on syntactic similarity or produce coarse vulnerability predictions without clear explanations, limiting their practical utility. In this paper, we propose VulCoCo, a lightweight and scalable approach that combines embedding-based retrieval with large language model (LLM) validation. Starting from a set of known vulnerable functions, we retrieve syntactically or semantically similar candidate functions from a large corpus and use an LLM to assess whether the candidates retain the vulnerability. Given that there is a lack of reproducible vulnerable code clone benchmarks, we first construct a synthetic benchmark that spans various clone types.
Our experiments on the benchmark show that VulCoCo outperforms prior state-of-the-art methods in terms of Precision@k and mean average precision (MAP). In addition, we also demonstrate VulCoCo's effectiveness in real-world projects by submitting 400 pull requests (PRs) to 284 open-source projects. Among them, 75 PRs were merged, and 15 resulted in newly published CVEs. We also provide insights to inspire future work to further improve the precision of vulnerable code clone detection.
中文摘要:VulCoCo是一种结合嵌入检索与大语言模型验证的新方法,能有效检测易受攻击的代码克隆,在基准测试和实际应用中均表现优异。
English Summary: VulCoCo is a novel approach that enhances vulnerable code clone detection by combining embedding-based retrieval with large language model validation, demonstrating superior performance in benchmarks and real-world applications.
Authors:Chenqi Lin, Kang Yang, Tianshi Xu, Ling Liang, Yufei Wang, Zhaohui Chen, Runsheng Wang, Mingyu Gao, Meng Li
Abstract:
With the wide application of machine learning (ML), privacy concerns arise with user data as they may contain sensitive information. Privacy-preserving ML (PPML) based on cryptographic primitives has emerged as a promising solution in which an ML model is directly computed on the encrypted data to provide a formal privacy guarantee. However, PPML frameworks heavily rely on the oblivious transfer (OT) primitive to compute nonlinear functions. OT mainly involves the computation of single-point correlated OT (SPCOT) and learning parity with noise (LPN) operations. As OT is still computed extensively on general-purpose CPUs, it becomes the latency bottleneck of modern PPML frameworks.
In this paper, we propose a novel OT accelerator, dubbed Ironman, to significantly increase the efficiency of OT and the overall PPML framework. We observe that SPCOT is computation-bounded, and thus propose a hardware-friendly SPCOT algorithm with a customized accelerator to improve SPCOT computation throughput. In contrast, LPN is memory-bandwidth-bounded due to irregular memory access patterns. Hence, we further leverage the near-memory processing (NMP) architecture equipped with memory-side cache and index sorting to improve effective memory bandwidth. With extensive experiments, we demonstrate Ironman achieves a 39.2-237.4 times improvement in OT throughput across different NMP configurations compared to the full-thread CPU implementation. For different PPML frameworks, Ironman demonstrates a 2.1-3.4 times reduction in end-to-end latency for both CNN and Transformer models.
中文: 本文提出的Ironman加速器通过分别优化计算密集型SPCOT操作和内存带宽受限的LPN操作,显著提升了不经意传输的吞吐效率,从而大幅降低了隐私保护机器学习框架的整体延迟。
English: The proposed Ironman accelerator enhances privacy-preserving machine learning by optimizing both computation-bound SPCOT operations and memory-bandwidth-bound LPN operations, achieving significant improvements in OT throughput and overall framework latency.
Authors:Chenqi Lin, Kang Yang, Tianshi Xu, Ling Liang, Yufei Wang, Zhaohui Chen, Runsheng Wang, Mingyu Gao, Meng Li
Abstract:
With the wide application of machine learning (ML), privacy concerns arise with user data as they may contain sensitive information. Privacy-preserving ML (PPML) based on cryptographic primitives has emerged as a promising solution in which an ML model is directly computed on the encrypted data to provide a formal privacy guarantee. However, PPML frameworks heavily rely on the oblivious transfer (OT) primitive to compute nonlinear functions. OT mainly involves the computation of single-point correlated OT (SPCOT) and learning parity with noise (LPN) operations. As OT is still computed extensively on general-purpose CPUs, it becomes the latency bottleneck of modern PPML frameworks. In this paper, we propose a novel OT accelerator, dubbed Ironman, to significantly increase the efficiency of OT and the overall PPML framework. We observe that SPCOT is computation-bounded, and thus propose a hardware-friendly SPCOT algorithm with a customized accelerator to improve SPCOT computation throughput. In contrast, LPN is memory-bandwidth-bounded due to irregular memory access patterns. Hence, we further leverage the near-memory processing (NMP) architecture equipped with memory-side cache and index sorting to improve effective memory bandwidth. With extensive experiments, we demonstrate Ironman achieves a 39.2-237.4 times improvement in OT throughput across different NMP configurations compared to the full-thread CPU implementation. For different PPML frameworks, Ironman demonstrates a 2.1-3.4 times reduction in end-to-end latency for both CNN and Transformer models.
中文: 本文提出的Ironman加速器通过分别优化计算密集型SPCOT操作和内存带宽受限的LPN操作,显著提升了不经意传输的吞吐效率,从而大幅降低了隐私保护机器学习框架的整体延迟。
English: The proposed Ironman accelerator enhances privacy-preserving machine learning by optimizing both computation-bound SPCOT operations and memory-bandwidth-bound LPN operations, achieving significant improvements in OT throughput and overall framework latency.
Authors:Jiahui Zhang, Yuelei Li, Anpei Chen, Muyu Xu, Kunhao Liu, Jianyuan Wang, Xiao-Xiao Long, Hanxue Liang, Zexiang Xu, Hao Su, Christian Theobalt, Christian Rupprecht, Andrea Vedaldi, Hanspeter Pfister, Shijian Lu, Fangneng Zhan
Abstract:
3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins. Traditional methods rely on computationally intensive iterative optimization in a complex chain, limiting their applicability in real-world scenarios. Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis. This survey offers a comprehensive review of feed-forward techniques for 3D reconstruction and view synthesis, with a taxonomy according to the underlying representation architectures including point cloud, 3D Gaussian Splatting (3DGS), Neural Radiance Fields (NeRF), etc. We examine key tasks such as pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware image and video synthesis, highlighting their applications in digital humans, SLAM, robotics, and beyond. In addition, we review commonly used datasets with detailed statistics, along with evaluation protocols for various downstream tasks. We conclude by discussing open research challenges and promising directions for future work, emphasizing the potential of feed-forward approaches to advance the state of the art in 3D vision.
中文: 本综述全面评述了基于前馈技术的三维重建与视图合成方法,按表示架构进行分类并探讨关键任务与应用,同时分析了当前挑战与未来研究方向。
English: This survey comprehensively reviews feed-forward techniques for 3D reconstruction and view synthesis, categorizing them by representation architectures and examining key tasks and applications, while also addressing current challenges and future directions.
Authors:Jiahui Zhang, Yuelei Li, Anpei Chen, Muyu Xu, Kunhao Liu, Jianyuan Wang, Xiao-Xiao Long, Hanxue Liang, Zexiang Xu, Hao Su, Christian Theobalt, Christian Rupprecht, Andrea Vedaldi, Kaichen Zhou, Paul Pu Liang, Shijian Lu, Fangneng Zhan
Abstract:
3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins. Traditional methods rely on computationally intensive iterative optimization in a complex chain, limiting their applicability in real-world scenarios. Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis. This survey offers a comprehensive review of feed-forward techniques for 3D reconstruction and view synthesis, with a taxonomy according to the underlying representation architectures including point cloud, 3D Gaussian Splatting (3DGS), Neural Radiance Fields (NeRF), etc. We examine key tasks such as pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware image and video synthesis, highlighting their applications in digital humans, SLAM, robotics, and beyond. In addition, we review commonly used datasets with detailed statistics, along with evaluation protocols for various downstream tasks. We conclude by discussing open research challenges and promising directions for future work, emphasizing the potential of feed-forward approaches to advance the state of the art in 3D vision.
中文: 本综述全面评述了基于前馈技术的三维重建与视图合成方法,按表示架构进行分类并探讨关键任务与应用,同时分析了当前挑战与未来研究方向。
English: This survey comprehensively reviews feed-forward techniques for 3D reconstruction and view synthesis, categorizing them by representation architectures and examining key tasks and applications, while also addressing current challenges and future directions.
Authors:Weizhi Zhang, Liangwei Yang, Zihe Song, Henrry Peng Zou, Ke Xu, Yuanjie Zhu, Philip S. Yu
Abstract:
Recommender systems (RecSys) are essential for online platforms, providing personalized suggestions to users within a vast sea of information. Self-supervised graph learning seeks to harness high-order collaborative filtering signals through unsupervised augmentation on the user-item bipartite graph, primarily leveraging a multi-task learning framework that includes both supervised recommendation loss and self-supervised contrastive loss. However, this separate design introduces additional graph convolution processes and creates inconsistencies in gradient directions due to disparate losses, resulting in prolonged training times and sub-optimal performance. In this study, we introduce a unified framework of Supervised Graph Contrastive Learning for recommendation (SGCL) to address these issues. SGCL uniquely combines the training of recommendation and unsupervised contrastive losses into a cohesive supervised contrastive learning loss, aligning both tasks within a single optimization direction for exceptionally fast training. Extensive experiments on three real-world datasets show that SGCL outperforms state-of-the-art methods, achieving superior accuracy and efficiency.
Chinese Summary: 本研究提出监督图对比学习(SGCL)统一框架,将推荐任务与对比学习损失结合,解决了自监督图推荐系统中训练效率低和性能受限的问题,实验证明其在精度和速度上均优于现有方法。
English Summary: This study introduces Supervised Graph Contrastive Learning (SGCL), a unified framework that integrates recommendation and contrastive learning losses to resolve training inefficiencies and performance limitations in self-supervised graph-based recommender systems, demonstrating superior accuracy and speed in experiments.
Authors:Carlos Arriaga, Gonzalo MartÃnez, Eneko Sendin, Javier Conde, Pedro Reviriego
Abstract:
The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.
中文: 本文介绍了GEA(生成式能源竞技场),它将能耗数据融入大语言模型评估中,初步结果表明用户在了解能耗信息后大多倾向于选择更小、更节能的模型,这说明对于多数交互场景,高性能模型带来的额外能耗成本并未带来与之匹配的感知质量提升。
English: This paper introduces GEA, a Generative Energy Arena that integrates energy consumption data into LLM evaluations, revealing that users typically prefer smaller, more energy-efficient models when aware of their environmental impact, suggesting the perceived quality of top models does not justify their higher energy costs for most interactions.
Authors:Luca Della Libera, Cem Subakan, Mirco Ravanelli
Abstract:
In speech processing pipelines, improving the quality and intelligibility of real-world recordings is crucial. While supervised regression is the primary method for speech enhancement, audio tokenization is emerging as a promising alternative for a smooth integration with other modalities. However, research on speech enhancement using discrete representations is still limited. Previous work has mainly focused on semantic tokens, which tend to discard key acoustic details such as speaker identity. Additionally, these studies typically employ non-autoregressive models, assuming conditional independence of outputs and overlooking the potential improvements offered by autoregressive modeling. To address these gaps we: 1) conduct a comprehensive study of the performance of acoustic tokens for speech enhancement, including the effect of bitrate and noise strength; 2) introduce a novel transducer-based autoregressive architecture specifically designed for this task. Experiments on VoiceBank and Libri1Mix datasets show that acoustic tokens outperform semantic tokens in terms of preserving speaker identity, and that our autoregressive approach can further improve performance. Nevertheless, we observe that discrete representations still fall short compared to continuous ones, highlighting the need for further research in this area.
中文: 本研究探索了利用声学标记进行语音增强,提出了一种新型自回归模型,在保留说话人身份方面优于语义标记,但仍落后于连续表示,凸显了该领域进一步研究的必要性。
English: This study explores speech enhancement using acoustic tokens, introducing a novel autoregressive model that outperforms semantic tokens in preserving speaker identity but still lags behind continuous representations, highlighting the need for further research.
Authors:Zihan Zhao, Pengfei Wang, Minfeng Xu, Shuangmin Chen, Shiqing Xin, Changhe Tu, Wenping Wang
Abstract:
Offset surfaces, defined as the Minkowski sum of a base surface and a rolling ball, play a crucial role in geometry processing, with applications ranging from coverage motion planning to brush modeling. While considerable progress has been made in computing constant-radius offset surfaces, computing variable-radius offset surfaces remains a challenging problem. In this paper, we present OffsetCrust, a novel framework that efficiently addresses the variable-radius offsetting problem by computing a power diagram. Let $R$ denote the radius function defined on the base surface $S$. The power diagram is constructed from contributing sites, consisting of carefully sampled base points on $S$ and their corresponding off-surface points, displaced along $R$-dependent directions. In the constant-radius case only, these displacement directions align exactly with the surface normals of $S$. Moreover, our method mitigates the misalignment issues commonly seen in crust-based approaches through a lightweight fine-tuning procedure. We validate the accuracy and efficiency of OffsetCrust through extensive experiments, and demonstrate its practical utility in applications such as reconstructing original boundary surfaces from medial axis transform (MAT) representations.
中文:OffsetCrust是一种新颖框架,通过构建基于采样基点和离面点的幂图高效计算变半径偏置曲面,解决了常见错位问题,并在几何处理中展现了实际应用价值。
English: OffsetCrust is a novel framework that efficiently computes variable-radius offset surfaces using a power diagram constructed from sampled base and off-surface points, addressing misalignment issues and demonstrating practical applications in geometry processing.
Authors:Jiaming Tian, Liyao Li, Wentao Ye, Haobo Wang, Lingxin Wang, Lihua Yu, Zujie Ren, Gang Chen, Junbo Zhao
Abstract:
Tables are fundamental in domains such as finance, healthcare, and public administration, yet real-world table tasks often involve noise, structural heterogeneity, and semantic complexity--issues underexplored in existing research that primarily targets clean academic datasets. This survey focuses on LLM-based Table Agents, which aim to automate table-centric workflows by integrating preprocessing, reasoning, and domain adaptation. We define five core competencies--C1: Table Structure Understanding, C2: Table and Query Semantic Understanding, C3: Table Retrieval and Compression, C4: Executable Reasoning with Traceability, and C5: Cross-Domain Generalization--to analyze and compare current approaches. In addition, a detailed examination of the Text-to-SQL Agent reveals a performance gap between academic benchmarks and real-world scenarios, especially for open-source models. Finally, we provide actionable insights to improve the robustness, generalization, and efficiency of LLM-based Table Agents in practical settings.
中文: 本综述聚焦基于大语言模型的表格智能体,通过定义五大核心能力分析现有方法,旨在解决实际场景中的噪声与结构异质性等难题,并为提升其实用性提供改进方向。
English: This survey explores LLM-based Table Agents designed to automate complex table workflows by addressing real-world challenges like noise and structural heterogeneity, defining five core competencies to evaluate current approaches and offering insights to enhance their practical robustness and efficiency.
Authors:Fabian Ritter-Gutierrez, Yi-Cheng Lin, Jui-Chiang Wei, Jeremy H. M. Wong, Nancy F. Chen, Hung-yi Lee
Abstract:
Evaluation of text-to-music systems is constrained by the cost and availability of collecting experts for assessment. AudioMOS 2025 Challenge track 1 is created to automatically predict music impression (MI) as well as text alignment (TA) between the prompt and the generated musical piece. This paper reports our winning system, which uses a dual-branch architecture with pre-trained MuQ and RoBERTa models as audio and text encoders. A cross-attention mechanism fuses the audio and text representations. For training, we reframe the MI and TA prediction as a classification task. To incorporate the ordinal nature of MOS scores, one-hot labels are converted to a soft distribution using a Gaussian kernel. On the official test set, a single model trained with this method achieves a system-level Spearman's Rank Correlation Coefficient (SRCC) of 0.991 for MI and 0.952 for TA, corresponding to a relative improvement of 21.21\% in MI SRCC and 31.47\% in TA SRCC over the challenge baseline.
中文: 该AudioMOS 2025挑战赛的获胜系统采用双分支架构和交叉注意力机制融合音频文本特征,通过将音乐印象与文本对齐预测重构为高斯软化标签的分类任务,显著提升了预测性能。
English: The winning system in the AudioMOS 2025 Challenge uses a dual-branch architecture with cross-attention to fuse audio and text representations, achieving significant improvements in music impression and text alignment prediction by reframing them as classification tasks with Gaussian-softened labels.
Authors:Huihui Huang, Ratnadira Widyasari, Ting Zhang, Ivana Clairine Irsan, Jieke Shi, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, Hong Jin Kang, David Lo
Abstract:
Issue-commit linking, which connects issues with commits that fix them, is crucial for software maintenance. Existing approaches have shown promise in automatically recovering these links. Evaluations of these techniques assess their ability to identify genuine links from plausible but false links. However, these evaluations overlook the fact that, in reality, when a repository has more commits, the presence of more plausible yet unrelated commits may interfere with the tool in differentiating the correct fix commits. To address this, we propose the Realistic Distribution Setting (RDS) and use it to construct a more realistic evaluation dataset that includes 20 open-source projects. By evaluating tools on this dataset, we observe that the performance of the state-of-the-art deep learning-based approach drops by more than half, while the traditional Information Retrieval method, VSM, outperforms it.
Inspired by these observations, we propose EasyLink, which utilizes a vector database as a modern Information Retrieval technique. To address the long-standing problem of the semantic gap between issues and commits, EasyLink leverages a large language model to rerank the commits retrieved from the database. Under our evaluation, EasyLink achieves an average Precision@1 of 75.91%, improving over the state-of-the-art by over four times. Additionally, this paper provides practical guidelines for advancing research in issue-commit link recovery.
中文: 现有问题-提交链接评估未考虑真实仓库规模中大量无关提交的干扰,为此提出的EasyLink方法结合向量数据库检索与大语言模型重排序,在评估中达到75.03%的Precision@1,性能超越现有最佳方法四倍以上。
English: Current issue-commit linking evaluations fail to account for realistic repository scale, where increased plausible but unrelated commits hinder tool performance, leading to the proposed EasyLink method that combines vector database retrieval with LLM reranking to achieve a 75.03% Precision@1, outperforming existing approaches by over fourfold.
Authors:Huihui Huang, Ratnadira Widyasari, Ting Zhang, Ivana Clairine Irsan, Jieke Shi, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, Hong Jin Kang, David Lo
Abstract:
Issue-commit linking, which connects issues with commits that fix them, is crucial for software maintenance. Existing approaches have shown promise in automatically recovering these links. Evaluations of these techniques assess their ability to identify genuine links from plausible but false links. However, these evaluations overlook the fact that, in reality, when a repository has more commits, the presence of more plausible yet unrelated commits may interfere with the tool in differentiating the correct fix commits. To address this, we propose the Realistic Distribution Setting (RDS) and use it to construct a more realistic evaluation dataset that includes 20 open-source projects. By evaluating tools on this dataset, we observe that the performance of the state-of-the-art deep learning-based approach drops by more than half, while the traditional Information Retrieval method, VSM, outperforms it. Inspired by these observations, we propose EasyLink, which utilizes a vector database as a modern Information Retrieval technique. To address the long-standing problem of the semantic gap between issues and commits, EasyLink leverages a large language model to rerank the commits retrieved from the database. Under our evaluation, EasyLink achieves an average Precision@1 of 75.03\%, improving over the state-of-the-art by over four times. Additionally, this paper provides practical guidelines for advancing research in issue-commit link recovery.
中文: 现有问题-提交链接评估未考虑真实仓库规模中大量无关提交的干扰,为此提出的EasyLink方法结合向量数据库检索与大语言模型重排序,在评估中达到75.03%的Precision@1,性能超越现有最佳方法四倍以上。
English: Current issue-commit linking evaluations fail to account for realistic repository scale, where increased plausible but unrelated commits hinder tool performance, leading to the proposed EasyLink method that combines vector database retrieval with LLM reranking to achieve a 75.03% Precision@1, outperforming existing approaches by over fourfold.
Authors:Zesong Yang, Bangbang Yang, Wenqi Dong, Chenxuan Cao, Liyuan Cui, Yuewen Ma, Zhaopeng Cui, Hujun Bao
Abstract:
Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holistic 3D perception of complex scenes with a primary goal: decomposing arbitrary instances while ensuring complete reconstruction. To achieve precise decomposition, we develop a novel spatial contrastive learning by tracing rasterization of each instance across views, significantly enhancing semantic supervision in cluttered scenes. To overcome incompleteness from limited observations, we introduce in-situ generation that harnesses valuable observations and geometric cues, effectively guiding 3D generative models to reconstruct complete instances that seamlessly align with the real world. Experiments on scene decomposition and object completion across complex real-world and synthetic scenes demonstrate that our method achieves superior decomposition accuracy while producing geometrically faithful and visually intact objects.
中文: 本文提出InstaScene新范式,通过空间对比学习和原位生成技术,使机器人能够分解复杂场景中的被遮挡物体并实现完整重建,在真实与合成场景实验中展现出优越的分解精度与几何保真度。
English: This paper introduces InstaScene, a novel approach that enables robots to decompose complex scenes into individual objects and reconstruct them completely, overcoming challenges of occlusion and clutter through spatial contrastive learning and in-situ generation techniques.
Authors:Hao Xu, Arbind Agrahari Baniya, Sam Wells, Mohamed Reda Bouadjenek, Richard Dazeley, Sunil Aryal
Abstract:
Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as Gate Shift Module (GSM) or Gate Shift Fuse (GSF) to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose a Multi-Scale Attention Gate Shift Module (MSAGSM) that enhances GSM with multi-scale temporal dilations and multi-head spatial attention, enabling efficient modeling of both short- and long-term dependencies while focusing on salient regions. MSAGSM is a lightweight plug-and-play module that can be easily integrated with various 2D backbones. To further advance the field, we introduce the Table Tennis Australia (TTA) dataset-the first PES benchmark for table tennis-containing over 4800 precisely annotated events. Extensive experiments across five PES benchmarks demonstrate that MSAGSM consistently improves performance with minimal overhead, setting new state-of-the-art results.
中文: 提出的多尺度注意力门控移位模块通过融合多尺度时序扩张和空间注意力机制,有效提升了体育视频中精细动作的时序建模能力,在多个基准测试中取得最优性能,并发布了首个带精确标注的乒乓球事件检测数据集。
English: The proposed Multi-Scale Attention Gate Shift Module (MSAGSM) enhances temporal modeling in sports video analysis by incorporating multi-scale dilations and spatial attention, achieving state-of-the-art performance across benchmarks while introducing a new table tennis dataset with precise annotations.
Authors:Zhihao Li, Kun Li, Boyang Ma, Minghui Xu, Yue Zhang, Xiuzhen Cheng
Abstract:
The Model Context Protocol (MCP) has emerged as a widely adopted mechanism for connecting large language models to external tools and resources. While MCP promises seamless extensibility and rich integrations, it also introduces a substantially expanded attack surface: any plugin can inherit broad system privileges with minimal isolation or oversight. In this work, we conduct the first large-scale empirical analysis of MCP security risks. We develop an automated static analysis framework and systematically examine 2,562 real-world MCP applications spanning 23 functional categories. Our measurements reveal that network and system resource APIs dominate usage patterns, affecting 1,438 and 1,237 servers respectively, while file and memory resources are less frequent but still significant. We find that Developer Tools and API Development plugins are the most API-intensive, and that less popular plugins often contain disproportionately high-risk operations. Through concrete case studies, we demonstrate how insufficient privilege separation enables privilege escalation, misinformation propagation, and data tampering. Based on these findings, we propose a detailed taxonomy of MCP resource access, quantify security-relevant API usage, and identify open challenges for building safer MCP ecosystems, including dynamic permission models and automated trust assessment.
中文: 模型上下文协议(MCP)通过允许插件以最小监管获得广泛系统权限,显著扩大了安全风险,对2562个应用的分析显示网络和系统API使用占主导,这可能导致权限提升和数据篡改等安全问题。
English: The Model Context Protocol (MCP) significantly expands security risks by allowing plugins broad system access with minimal oversight, as demonstrated through analysis of 2,562 applications revealing prevalent network and system API usage that enables privilege escalation and data tampering.
Authors:Ruoxi Wang, Kun Li, Minghui Xu, Yue Zhang, Kaidi Xu, Chunchi Liu, Yinhao Xiao, Xiuzhen Cheng
Abstract:
Dynamic Symbolic Execution (DSE) is a key technique in program analysis, widely used in software testing, vulnerability discovery, and formal verification. In distributed AI systems, DSE plays a crucial role in identifying hard-to-detect bugs, especially those arising from complex network communication patterns. However, traditional approaches to symbolic execution are often hindered by scalability issues and inefficiencies, particularly in large-scale systems. This paper introduces LIFT (Large-language-model Integrated Functional-equivalent-IR Transformation), a novel framework that leverages Large Language Models (LLMs) to automate the optimization of Intermediate Representations (IRs) in symbolic execution. LIFT addresses the challenges of symbolic execution by providing a scalable, context-sensitive solution for IR transformation. The framework consists of two phases: IR Analysis and Optimization, where LLMs optimize time-intensive IR blocks, and Symbolic Execution and Validation, which includes benchmarking and semantic verification to ensure correctness and generalizability. Experiments on real-world binaries demonstrated significant performance improvements, including a 53.5\% reduction in execution time for bigtest and a 10.24\% reduction for random, along with reductions in IR statements, PUT instructions, and temporary variables. These results demonstrate that LLMs simplify IRs while maintaining functional correctness, enhancing symbolic execution in distributed AI systems.
中文: 本文提出的LIFT框架利用大语言模型优化符号执行中的中间表示,在分布式AI系统中显著提升了性能,不仅降低了执行时间、简化了中间表示,同时确保了功能正确性。
English: This paper introduces LIFT, a framework using Large Language Models to optimize Intermediate Representations in symbolic execution, which significantly improves performance in distributed AI systems by reducing execution time and simplifying IRs while preserving correctness.
Authors:Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, Jingren Zhou
Abstract:
To leverage the advantages of LLM in addressing challenges in the Text-to-SQL task, we present XiYan-SQL, an innovative framework effectively generating and utilizing multiple SQL candidates. It consists of three components: 1) a Schema Filter module filtering and obtaining multiple relevant schemas; 2) a multi-generator ensemble approach generating multiple highquality and diverse SQL queries; 3) a selection model with a candidate reorganization strategy implemented to obtain the optimal SQL query. Specifically, for the multi-generator ensemble, we employ a multi-task fine-tuning strategy to enhance the capabilities of SQL generation models for the intrinsic alignment between SQL and text, and construct multiple generation models with distinct generation styles by fine-tuning across different SQL formats. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63% on the notable BIRD benchmark, surpassing all previous methods. It also attains SOTA performance on the Spider test set with an accuracy of 89.65%.
中文摘要:XiYan-SQL提出了一种创新框架,通过模式过滤、多生成器集成和候选重组策略来优化Text-to-SQL任务,在BIRD(75.63%)和Spider(89.65%)基准测试中均实现了最先进的性能。
English Summary: XiYan-SQL introduces a novel framework that enhances Text-to-SQL tasks by generating multiple SQL candidates through schema filtering, a multi-generator ensemble, and an optimal selection model, achieving state-of-the-art performance on both BIRD (75.63%) and Spider (89.65%) benchmarks.
Authors:Tang Qian, Yifan Zhu, Lu Chen, Xiangyu Ke, Jingwen Zhao, Tianyi Li, Yunjun Gao, Christian S. Jensen
Abstract:
Increasingly massive volumes of multi-modal data are being accumulated in many {real world} settings, including in health care and e-commerce. This development calls for effective general-purpose data management solutions for multi-modal data. Such a solution must facilitate user-friendly and accurate retrieval of any multi-modal data according to diverse application requirements. Further, such a solution must be capable of efficient and scalable retrieval.
To address this need, we present OneDB, a distributed multi-metric data similarity retrieval system. This system exploits the fact that data of diverse modalities, such as text, images, and video, can be represented as metric data. The system thus affords each data modality its own metric space with its own distance function and then uses a multi-metric model to unify multi-modal data. The system features several innovations: (i) an extended Spart SQL query interface; (ii) lightweight means of learning appropriate weights of different modalities when retrieving multi-modal data to enable accurate retrieval; (iii) smart search-space pruning strategies that improve efficiency; (iv) two-layered indexing of data to ensure load-balancing during distributed processing; and (v) end-to-end system parameter autotuning.
Experiments on three real-life datasets and two synthetic datasets offer evidence that the system is capable of state-of-the-art performance: (i) efficient and effective weight learning; (ii) retrieval accuracy improvements of 12.63\%--30.75\% over the state-of-the-art vector similarity search system at comparable efficiency; (iii) accelerated search by 2.5--5.75x over state-of-the-art single- or multi-metric solutions; (iv) demonstrated high scalability; and (v) parameter tuning that enables performance improvements of 15+%.
中文: OneDB作为一种分布式多度量数据相似性检索系统,通过为多模态数据建立独立度量空间实现统一管理,在检索精度、效率和可扩展性方面均显著优于现有先进系统。
English: OneDB is a distributed multi-metric data similarity retrieval system that unifies multi-modal data through individual metric spaces, achieving significant improvements in retrieval accuracy, efficiency, and scalability compared to state-of-the-art systems.
Authors:Yi-Cheng Lin, Jia-Hung Chen, Hung-yi Lee
Abstract:
Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's Ï versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.
中文摘要:MMMOS作为一种无需参考音频的质量评估系统,通过评估四个独立感知维度并在多领域音频中应用,显著降低了预测误差并提升了评估指标排名,性能优于现有模型。
English Summary: MMMOS is a no-reference audio quality assessment system that evaluates four distinct perceptual dimensions across multiple audio domains, outperforming existing models by significantly reducing error and improving ranking correlation.
Authors:Zheng Jia, Shengbin Yue, Wei Chen, Siyuan Wang, Yidong Liu, Yun Song, Zhongyu Wei
Abstract:
The gap between static benchmarks and the dynamic nature of real-world legal practice poses a key barrier to advancing legal intelligence. To this end, we introduce J1-ENVS, the first interactive and dynamic legal environment tailored for LLM-based agents. Guided by legal experts, it comprises six representative scenarios from Chinese legal practices across three levels of environmental complexity. We further introduce J1-EVAL, a fine-grained evaluation framework, designed to assess both task performance and procedural compliance across varying levels of legal proficiency. Extensive experiments on 17 LLM agents reveal that, while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even the SOTA model, GPT-4o, falls short of 60% overall performance. These findings highlight persistent challenges in achieving dynamic legal intelligence and offer valuable insights to guide future research.
中文摘要:J1-ENVS交互式法律环境与J1-EVAL评估框架显示,尽管大语言模型具备扎实的法律知识,但在动态法律场景的程序执行中表现欠佳,即使顶尖模型整体效能也未达到60%。
English Summary: The J1-ENVS interactive legal environment and J1-EVAL evaluation framework reveal that while LLMs possess substantial legal knowledge, they significantly underperform in procedural execution within dynamic legal scenarios, with even top models failing to reach 60% overall effectiveness.
Authors:Xianquan Wang, Zhaocheng Du, Jieming Zhu, Chuhan Wu, Qinglin Jia, Zhenhua Dong
Abstract:
Feature interaction modeling is crucial for deep recommendation models. A common and effective approach is to construct explicit feature combinations to enhance model performance. However, in practice, only a small fraction of these combinations are truly informative. Thus it is essential to select useful feature combinations to reduce noise and manage memory consumption. While feature selection methods have been extensively studied, they are typically limited to selecting individual features. Extending these methods for high-order feature combination selection presents a significant challenge due to the exponential growth in time complexity when evaluating feature combinations one by one. In this paper, we propose $\textbf{TayFCS}$, a lightweight feature combination selection method that significantly improves model performance. Specifically, we propose the Taylor Expansion Scorer (TayScorer) module for field-wise Taylor expansion on the base model. Instead of evaluating all potential feature combinations' importance by repeatedly running experiments with feature adding and removal, this scorer only needs to approximate the importance based on their sub-components' gradients. This can be simply computed with one backward pass based on a trained recommendation model. To further reduce information redundancy among feature combinations and their sub-components, we introduce Logistic Regression Elimination (LRE), which estimates the corresponding information gain based on the model prediction performance. Experimental results on three benchmark datasets validate both the effectiveness and efficiency of our approach. Furthermore, online A/B test results demonstrate its practical applicability and commercial value.
中文: 本文提出TayFCS轻量级方法,通过泰勒展开和逻辑回归消除技术高效筛选高维特征组合,在提升推荐模型性能的同时有效降低噪声干扰与计算开销。
English: The paper introduces TayFCS, a lightweight method that uses Taylor expansion and logistic regression elimination to efficiently select informative high-order feature combinations, improving recommendation model performance while reducing noise and computational costs.
Authors:Yavuz Yarici, Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib
Abstract:
The high cost of annotating data makes self-supervised approaches, such as contrastive learning methods, appealing for Human Activity Recognition (HAR). Effective contrastive learning relies on selecting informative positive and negative samples. However, HAR sensor signals are subject to significant domain shifts caused by subject variability. These domain shifts hinder model generalization to unseen subjects by embedding subject-specific variations rather than activity-specific features. As a result, human activity recognition models trained with contrastive learning often struggle to generalize to new subjects. We introduce Subject-Invariant Contrastive Learning (SICL), a simple yet effective loss function to improve generalization in human activity recognition. SICL re-weights negative pairs drawn from the same subject to suppress subject-specific cues and emphasize activity-specific information. We evaluate our loss function on three public benchmarks: UTD-MHAD, MMAct, and DARai. We show that SICL improves performance by up to 11% over traditional contrastive learning methods. Additionally, we demonstrate the adaptability of our loss function across various settings, including multiple self-supervised methods, multimodal scenarios, and supervised learning frameworks.
Chinese: 针对人体活动识别中因受试者差异导致对比学习泛化能力受限的问题,提出的主体不变对比学习方法通过重新加权同一受试者的负样本对来抑制主体特异性特征、强化活动特异性信息,在多个基准测试中性能提升最高达11%。
English: Self-supervised contrastive learning for Human Activity Recognition faces generalization challenges due to domain shifts from subject variability, but the proposed Subject-Invariant Contrastive Learning (SICL) method addresses this by re-weighting same-subject negative pairs to enhance activity-specific feature learning, improving performance by up to 11% on benchmarks.
Authors:Weili Xu, Enxin Song, Wenhao Chai, Xuexiang Wen, Tian Ye, Gaoang Wang
Abstract:
The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a linear RNN based LLM backbone in a LLaVA-like model for open-ended video understanding.
Chinese: AuroraLong采用线性RNN语言模型替代MLLMs中的LLM组件,通过恒定内存处理任意长度视频输入,仅用20亿参数和公开数据即达到与Transformer模型相媲美的性能,显著降低了长视频理解的计算门槛。
English: AuroraLong introduces a linear RNN language model to replace LLMs in MLLMs, enabling efficient long video understanding with constant memory usage and competitive performance despite minimal parameters and public data training.
Authors:Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan
Abstract:
We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based on LEDOM, we further introduce a novel application: Reverse Reward, where LEDOM-guided reranking of forward language model outputs leads to substantial performance improvements on mathematical reasoning tasks. This approach leverages LEDOM's unique backward reasoning capability to refine generation quality through posterior evaluation. Our findings suggest that LEDOM exhibits unique characteristics with broad application potential. We will release all models, training code, and pre-training data to facilitate future research.
中文摘要:LEDOM是首个纯逆向语言模型,通过反向时序处理展现独特推理能力,其创新的逆向奖励应用能通过后验评估显著提升数学推理任务的生成质量。
English Summary: LEDOM is the first reverse language model trained autoregressively on 435B tokens, demonstrating unique backward reasoning capabilities that enable novel applications like Reverse Reward for improving mathematical reasoning through output reranking.
Authors:Aymen Rayane Khouas, Mohamed Reda Bouadjenek, Hakim Hacid, Sunil Aryal
Abstract:
Graph federated recommendation systems offer a privacy-preserving alternative to traditional centralized recommendation architectures, which often raise concerns about data security. While federated learning enables personalized recommendations without exposing raw user data, existing aggregation methods overlook the unique properties of user embeddings in this setting. Indeed, traditional aggregation methods fail to account for their complexity and the critical role of user similarity in recommendation effectiveness. Moreover, evolving user interactions require adaptive aggregation while preserving the influence of high-relevance anchor users (the primary users before expansion in graph-based frameworks). To address these limitations, we introduce Dist-FedAvg, a novel distance-based aggregation method designed to enhance personalization and aggregation efficiency in graph federated learning. Our method assigns higher aggregation weights to users with similar embeddings, while ensuring that anchor users retain significant influence in local updates. Empirical evaluations on multiple datasets demonstrate that Dist-FedAvg consistently outperforms baseline aggregation techniques, improving recommendation accuracy while maintaining seamless integration into existing federated learning frameworks.
中文摘要:Dist-FedAvg是一种基于距离的新型聚合方法,通过优先考虑相似用户嵌入并保留核心用户影响力,有效提升了图联邦推荐系统的个性化效果,在保持框架兼容性的同时显著优于现有聚合技术的推荐准确性。
English Summary: Dist-FedAvg is a novel distance-based aggregation method that enhances personalization in graph federated recommendation systems by prioritizing similar user embeddings and preserving anchor user influence, outperforming existing techniques in accuracy while maintaining framework compatibility.
Authors:MarÃa Grandury, Javier Aula-Blasco, Júlia Falcão, Clémentine Fourrier, Miguel González, Gonzalo MartÃnez, Gonzalo SantamarÃa, Rodrigo Agerri, Nuria Aldama, Luis Chiruzzo, Javier Conde, Helena Gómez, Marta Guerrero, Guido Ivetta, Natalia López, Flor Miriam Plaza-del-Arco, MarÃa Teresa MartÃn-Valdivia, Helena Montoro, Carmen Muñoz, Pedro Reviriego, Leire Rosado, Alejandro Vaca, MarÃa Estrella Vallecillo-RodrÃguez, Jorge Vallego, Irune Zubiaga
Abstract:
Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Basque, Catalan, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.
中文:La Leaderboard 是首个开源评估平台,旨在针对西班牙语地区的语言和文化多样性评估生成型大语言模型,整合了66个数据集和50个模型,以建立社区驱动的评估标准。
English: La Leaderboard is the first open-source platform designed to evaluate generative large language models across the linguistic and cultural diversity of Spanish-speaking regions, combining 66 datasets and 50 models to establish community-driven evaluation standards.
Authors:Tianyi Liu, Kejun Wu, Chen Cai, Yi Wang, Kim-Hui Yap, Lap-Pui Chau
Abstract:
Video signals are vulnerable in multimedia communication and storage systems, as even slight bitstream-domain corruption can lead to significant pixel-domain degradation. To recover faithful spatio-temporal content from corrupted inputs, bitstream-corrupted video recovery has recently emerged as a challenging and understudied task. However, existing methods require time-consuming and labor-intensive annotation of corrupted regions for each corrupted video frame, resulting in a large workload in practice. In addition, high-quality recovery remains difficult as part of the local residual information in corrupted frames may mislead feature completion and successive content recovery. In this paper, we propose the first blind bitstream-corrupted video recovery framework that integrates visual foundation models with a recovery model, which is adapted to different types of corruption and bitstream-level prompts. Within the framework, the proposed Detect Any Corruption (DAC) model leverages the rich priors of the visual foundation model while incorporating bitstream and corruption knowledge to enhance corruption localization and blind recovery. Additionally, we introduce a novel Corruption-aware Feature Completion (CFC) module, which adaptively processes residual contributions based on high-level corruption understanding. With VFM-guided hierarchical feature augmentation and high-level coordination in a mixture-of-residual-experts (MoRE) structure, our method suppresses artifacts and enhances informative residuals. Comprehensive evaluations show that the proposed method achieves outstanding performance in bitstream-corrupted video recovery without requiring a manually labeled mask sequence. The demonstrated effectiveness will help to realize improved user experience, wider application scenarios, and more reliable multimedia communication and storage systems.
中文: 本文提出了一种无需人工标注的盲比特流损坏视频恢复框架,通过结合视觉基础模型与自适应损坏检测及特征补全技术,有效抑制伪影并增强信息残留,实现了高质量的视频内容恢复。
English: This paper introduces a blind bitstream-corrupted video recovery framework that integrates visual foundation models with adaptive corruption detection and feature completion, achieving high-quality restoration without manual annotations while suppressing artifacts and enhancing residuals.
Authors:Jordan Vice, Naveed Akhtar, Yansong Gao, Richard Hartley, Ajmal Mian
Abstract:
Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning, including through captioning and DeepFake detection. In this work, we expose a critical vulnerability of VLMs when exposed to subtle, structured perturbations in the frequency domain. Specifically, we highlight how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks. We design targeted image transformations, operating in the frequency domain to systematically adjust VLM outputs when exposed to frequency-perturbed real and synthetic images. We demonstrate that the perturbation injection method generalizes across five state-of-the-art VLMs which includes different-parameter Qwen2/2.5 and BLIP models. Experimenting across ten real and generated image datasets reveals that VLM judgments are sensitive to frequency-based cues and may not wholly align with semantic content. Crucially, we show that visually-imperceptible spatial frequency transformations expose the fragility of VLMs deployed for automated image captioning and authenticity detection tasks. Our findings under realistic, black-box constraints challenge the reliability of VLMs, underscoring the need for robust multimodal perception systems.
中文: 本研究揭示了视觉语言模型在频域扰动面前的脆弱性——通过不可见的频率变换可系统操控其图像描述和DeepFake检测结果,这对黑盒环境下部署的多模态感知系统可靠性提出了严峻挑战。
English: This study reveals that Vision-Language Models (VLMs) are vulnerable to subtle frequency-domain perturbations, which can systematically manipulate their outputs in image captioning and DeepFake detection tasks, exposing critical reliability issues under black-box conditions.
Authors:Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang
Abstract:
Large Language Models (LLMs) have demonstrated strong capabilities but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift -- from scaling static models to developing self-evolving agents -- has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organized around three foundational dimensions -- what to evolve, when to evolve, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing adaptive agentic systems in both research and real-world deployments, ultimately shedding lights to pave the way for the realization of Artificial Super Intelligence (ASI), where agents evolve autonomously, performing at or beyond human-level intelligence across a wide array of tasks.
中文: 大语言模型本质上是静态的,在动态环境中成为关键瓶颈,促使研究转向能够持续学习和适应的自进化智能体,本综述为此类智能体的发展和应用提供了系统框架。
English: Large Language Models are fundamentally static, creating a bottleneck in dynamic environments and sparking a paradigm shift toward self-evolving agents that adapt through continual learning, with this survey providing a comprehensive framework for their development and application.
Authors:Yuhan Wang, Siwei Yang, Bingchen Zhao, Letian Zhang, Qing Liu, Yuyin Zhou, Cihang Xie
Abstract:
Recent advancements in large multimodal models like GPT-4o have set a new standard for high-fidelity, instruction-guided image editing. However, the proprietary nature of these models and their training data creates a significant barrier for open-source research. To bridge this gap, we introduce GPT-IMAGE-EDIT-1.5M, a publicly available, large-scale image-editing corpus containing more than 1.5 million high-quality triplets (instruction, source image, edited image). We systematically construct this dataset by leveraging the versatile capabilities of GPT-4o to unify and refine three popular image-editing datasets: OmniEdit, HQ-Edit, and UltraEdit. Specifically, our methodology involves 1) regenerating output images to enhance visual quality and instruction alignment, and 2) selectively rewriting prompts to improve semantic clarity. To validate the efficacy of our dataset, we fine-tune advanced open-source models on GPT-IMAGE-EDIT-1.5M. The empirical results are exciting, e.g., the fine-tuned FluxKontext achieves highly competitive performance across a comprehensive suite of benchmarks, including 7.24 on GEdit-EN, 3.80 on ImgEdit-Full, and 8.78 on Complex-Edit, showing stronger instruction following and higher perceptual quality while maintaining identity. These scores markedly exceed all previously published open-source methods and substantially narrow the gap to leading proprietary models. We hope the full release of GPT-IMAGE-EDIT-1.5M can help to catalyze further open research in instruction-guided image editing.
中文: 本研究推出了GPT-IMAGE-EDIT-1.5M大规模公开数据集,通过GPT-4o整合优化现有图像编辑资源,使得微调后的开源模型(如FluxKontext)在多项基准测试中表现出色,显著缩小了与专有模型的差距。
English: The study introduces GPT-IMAGE-EDIT-1.5M, a large-scale public dataset created using GPT-4o to unify and enhance existing image-editing resources, enabling fine-tuned open-source models like FluxKontext to achieve competitive performance and close the gap with proprietary systems.
Authors:Mingfeng Zha, Tianyu Li, Guoqing Wang, Peng Wang, Yangyang Wu, Yang Yang, Heng Tao Shen
Abstract:
Audio-visual segmentation (AVS) aims to segment objects in videos based on audio cues. Existing AVS methods are primarily designed to enhance interaction efficiency but pay limited attention to modality representation discrepancies and imbalances. To overcome this, we propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding. Due to the lack of semantics, heterogeneous representations may lead to erroneous matches, especially in complex scenes with ambiguous visual content or interference from multiple audio sources. We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space, reducing modality gaps and providing prior guidance. Visual content carries more information and typically dominates, thereby marginalizing audio features in the decision-making. To mitigate knowledge preference, we propose the semantic counterfactual (SC) to learn orthogonal representations in the latent space, generating diverse counterfactual samples, thus avoiding biases introduced by complex functional designs and explicit modifications of text structures or attributes. We further formulate the collaborative distribution-aware contrastive learning (CDCL), incorporating factual-counterfactual and inter-modality contrasts to align representations, promoting cohesion and decoupling. Extensive experiments on three public datasets validate that the proposed method achieves state-of-the-art performance.
中文总结:提出的隐式反事实框架通过多粒度隐式文本作为模态共享桥梁,结合语义反事实学习和协作分布感知对比学习,有效解决了视听分割中的模态差异与不平衡问题,在三个公开数据集上实现了最优性能。
English Summary: The proposed implicit counterfactual framework (ICF) addresses modality discrepancies in audio-visual segmentation by introducing multi-granularity implicit text as a semantic bridge and semantic counterfactual learning to achieve unbiased cross-modal understanding, validated through state-of-the-art performance on three datasets.
Authors:Guangyan Chen, Meiling Wang, Te Cui, Yao Mu, Haoyang Lu, Zicai Peng, Mengxiao Hu, Tianxing Zhou, Mengyin Fu, Yi Yang, Yufeng Yue
Abstract:
Visual imitation learning (VIL) provides an efficient and intuitive strategy for robotic systems to acquire novel skills. Recent advancements in foundation models, particularly Vision Language Models (VLMs), have demonstrated remarkable capabilities in visual and linguistic reasoning for VIL tasks. Despite this progress, existing approaches primarily utilize these models for learning high-level plans from human demonstrations, relying on pre-defined motion primitives for executing physical interactions, which remains a major bottleneck for robotic systems. In this work, we present FMimic, a novel paradigm that harnesses foundation models to directly learn generalizable skills at even fine-grained action levels, using only a limited number of human videos. Extensive experiments demonstrate that our FMimic delivers strong performance with a single human video, and significantly outperforms all other methods with five videos. Furthermore, our method exhibits significant improvements of over 39% and 29% in RLBench multi-task experiments and real-world manipulation tasks, respectively, and exceeds baselines by more than 34% in high-precision tasks and 47% in long-horizon tasks.
Chinese: FMimic提出了一种利用基础模型直接从少量人类视频中学习可泛化的精细机器人技能的新范式,在多任务实验和实际应用中均表现出卓越性能。
English: FMimic introduces a novel paradigm leveraging foundation models to directly learn generalizable fine-grained robotic skills from limited human videos, achieving superior performance in multi-task experiments and real-world applications.
Authors:Yifan Zhang, Dianye Huang, Nassir Navab, Zhongliang Jiang
Abstract:
Medical ultrasound (US) imaging is widely used in clinical examinations due to its portability, real-time capability, and radiation-free nature. To address inter- and intra-operator variability, robotic ultrasound systems have gained increasing attention. However, their application in challenging intercostal imaging remains limited due to the lack of an effective scan path generation method within the constrained acoustic window. To overcome this challenge, we explore the potential of tactile cues for characterizing subcutaneous rib structures as an alternative signal for ultrasound segmentation-free bone surface point cloud extraction. Compared to 2D US images, 1D tactile-related signals offer higher processing efficiency and are less susceptible to acoustic noise and artifacts. By leveraging robotic tracking data, a sparse tactile point cloud is generated through a few scans along the rib, mimicking human palpation. To robustly map the scanning trajectory into the intercostal space, the sparse tactile bone location point cloud is first interpolated to form a denser representation. This refined point cloud is then registered to an image-based dense bone surface point cloud, enabling accurate scan path mapping for individual patients. Additionally, to ensure full coverage of the object of interest, we introduce an automated tilt angle adjustment method to visualize structures beneath the bone. To validate the proposed method, we conducted comprehensive experiments on four distinct phantoms. The final scanning waypoint mapping achieved Mean Nearest Neighbor Distance (MNND) and Hausdorff distance (HD) errors of 3.41 mm and 3.65 mm, respectively, while the reconstructed object beneath the bone had errors of 0.69 mm and 2.2 mm compared to the CT ground truth.
中文: 本研究提出一种机器人超声系统,利用触觉信号映射肋骨结构并生成肋间扫描路径,实现了精确的骨表面定位及被遮挡物体的可视化,误差显著降低。
English: This study introduces a robotic ultrasound system that uses tactile signals to map rib structures and generate precise scan paths for intercostal imaging, achieving accurate bone surface mapping and visualization of underlying objects with minimal error.
Authors:Zhi Zhou, Sirui Miao, Xiangyu Duan, Hao Yang, Min Zhang
Abstract:
Large language models (LLMs) have demonstrated remarkable abilities in various natural language processing areas, but they demand high computation resources which limits their deployment in real-world. Distillation is one technique to solve this problem through either knowledge distillation or task distillation. Both distillation approaches train small models to imitate specific features of LLMs, but they all neglect basic reading education for small models on generic texts that are \emph{unrelated} to downstream tasks. In this paper, we propose basic reading distillation (BRD) which educates a small model to imitate LLMs basic reading behaviors, such as named entity recognition, question raising and answering, on each sentence. After such basic education, we apply the small model on various tasks including language inference benchmarks and BIG-bench tasks. It shows that the small model can outperform or perform comparable to over 20x bigger LLMs. Analysis reveals that BRD effectively influences the probability distribution of the small model, and has orthogonality to either knowledge distillation or task distillation.
中文摘要:基础阅读蒸馏(BRD)通过训练小模型模仿大语言模型在通用文本上的基础阅读能力,使其在多种任务中表现优于或媲美超过自身20倍规模的大模型。
English Summary: Basic reading distillation (BRD) trains small models to mimic large language models' fundamental reading skills on generic texts, enabling them to outperform or match models over 20 times larger across various tasks.
Authors:Abdul Basit, Minghao Shao, Muhammad Haider Asif, Nouhaila Innan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique
Abstract:
The growing demand for robust quantum programming frameworks has unveiled a critical limitation: current large language model (LLM) based quantum code assistants heavily rely on remote APIs, introducing challenges related to privacy, latency, and excessive usage costs. Addressing this gap, we propose PennyCoder, a novel lightweight framework for quantum code generation, explicitly designed for local and embedded deployment to enable on-device quantum programming assistance without external API dependence. PennyCoder leverages a fine-tuned version of the LLaMA 3.1-8B model, adapted through parameter-efficient Low-Rank Adaptation (LoRA) techniques combined with domain-specific instruction tuning optimized for the specialized syntax and computational logic of quantum programming in PennyLane, including tasks in quantum machine learning and quantum reinforcement learning. Unlike prior work focused on cloud-based quantum code generation, our approach emphasizes device-native operability while maintaining high model efficacy. We rigorously evaluated PennyCoder over a comprehensive quantum programming dataset, achieving 44.3% accuracy with our fine-tuned model (compared to 33.7% for the base LLaMA 3.1-8B and 40.1% for the RAG-augmented baseline), demonstrating a significant improvement in functional correctness.
中文摘要:PennyCoder是一个基于优化版LLaMA 3.1-8B模型的轻量级量子代码生成框架,通过本地化部署解决了传统方案依赖远程API的问题,并在量子编程任务中实现了44.3%的准确率。
English Summary: PennyCoder is a lightweight framework using a fine-tuned LLaMA 3.1-8B model for local quantum code generation, eliminating API dependency while achieving 44.3% accuracy in quantum programming tasks.
Authors:Qijun Jiang, Xiaodan Shao, Rui Zhang
Abstract:
The six-dimensional movable antenna (6DMA) is a promising technology to fully exploit spatial variation in wireless channels by allowing flexible adjustment of three-dimensional (3D) positions and rotations of antennas at the transceiver. In this paper, we consider a 6DMA-equipped base station (BS) and aim to maximize the average sum logarithmic rate of all users served by the BS by jointly designing 6DMA surface positions and rotations based on statistical channel information (SCI). Different from prior works on 6DMA design which use alternating optimization to iteratively update surface positions and rotations, we propose a new sequential optimization method that first determines the optimal rotations and then identifies feasible positions to realize these rotations under practical antenna placement constraints. Simulation results show that our proposed optimization scheme significantly reduces the computational complexity of conventional alternating optimization (AO), while achieving communication performance comparable to the AO-based approach and superior to existing fixed-position/rotation antenna arrays.
中文摘要:本文针对六维可移动天线系统提出一种顺序优化方法,先确定最佳旋转角度再寻找可行位置,在显著降低计算复杂度的同时,实现了与传统交替优化方法相当且优于固定天线阵列的通信性能。
English Summary: The paper introduces a sequential optimization method for six-dimensional movable antenna (6DMA) systems that first determines optimal rotations and then feasible positions, significantly reducing computational complexity while maintaining comparable communication performance to conventional approaches.
Authors:Samuele Cornell, Christoph Boeddeker, Taejin Park, He Huang, Desh Raj, Matthew Wiesner, Yoshiki Masuyama, Xuankai Chang, Zhong-Qiu Wang, Stefano Squartini, Paola Garcia, Shinji Watanabe
Abstract:
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50\% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11\%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.
中文:CHiME-7和8挑战赛推动了多通道语音识别与分割技术发展,显示研究转向基于预训练模型的端到端系统,但神经语音分离和在复杂环境中实现准确转录仍面临挑战。
English: The CHiME-7 and 8 challenges advanced multi-channel speech recognition and diarization, revealing a shift to end-to-end systems using pre-trained models while highlighting persistent difficulties with neural speech separation and transcription accuracy in complex environments.
Authors:Alberto Marchisio, Muhammad Shafique
Abstract:
The growing need for intelligent, adaptive, and energy-efficient autonomous systems across fields such as robotics, mobile agents (e.g., UAVs), and self-driving vehicles is driving interest in neuromorphic computing. By drawing inspiration from biological neural systems, neuromorphic approaches offer promising pathways to enhance the perception, decision-making, and responsiveness of autonomous platforms. This paper surveys recent progress in neuromorphic algorithms, specialized hardware, and cross-layer optimization strategies, with a focus on their deployment in real-world autonomous scenarios. Special attention is given to event-based dynamic vision sensors and their role in enabling fast, efficient perception. The discussion highlights new methods that improve energy efficiency, robustness, adaptability, and reliability through the integration of spiking neural networks into autonomous system architectures. We integrate perspectives from machine learning, robotics, neuroscience, and neuromorphic engineering to offer a comprehensive view of the state of the field. Finally, emerging trends and open challenges are explored, particularly in the areas of real-time decision-making, continual learning, and the development of secure, resilient autonomous systems.
中文: 神经形态计算通过模拟生物神经系统,为开发智能、自适应且节能的自主系统提供了有前景的途径,其最新进展集中在算法、硬件及事件动态视觉传感器等实际应用上,以提升感知、决策和系统可靠性。
English: Neuromorphic computing is gaining attention for its potential to create intelligent, adaptive, and energy-efficient autonomous systems by mimicking biological neural networks, with recent advances focusing on algorithms, hardware, and real-world applications like event-based vision sensors and spiking neural networks.
Authors:Alessandro Sebastiano Catinello, Giovanni Maria Farinella, Antonino Furnari
Abstract:
This work tackles the problem of Online detection of Take and Release (OTR) of an object in untrimmed egocentric videos. This task is challenging due to severe label imbalance, with temporally sparse positive annotations, and the need for precise temporal predictions. Furthermore, methods need to be computationally efficient in order to be deployed in real-world online settings. To address these challenges, we propose Mamba-OTR, a model based on the Mamba architecture. Mamba-OTR is designed to exploit temporal recurrence during inference while being trained on short video clips. To address label imbalance, our training pipeline incorporates the focal loss and a novel regularization scheme that aligns model predictions with the evaluation metric. Extensive experiments on EPIC-KITCHENS-100, the comparisons with transformer-based approach, and the evaluation of different training and test schemes demonstrate the superiority of Mamba-OTR in both accuracy and efficiency. These finding are particularly evident when evaluating full-length videos or high frame-rate sequences, even when trained on short video snippets for computational convenience. The proposed Mamba-OTR achieves a noteworthy mp-mAP of 45.48 when operating in a sliding-window fashion, and 43.35 in streaming mode, versus the 20.32 of a vanilla transformer and 25.16 of a vanilla Mamba, thus providing a strong baseline for OTR. We will publicly release the source code of Mamba-OTR to support future research.
中文: 本研究提出了基于Mamba架构的Mamba-OTR模型,用于未剪辑第一人称视频中的物体取放在线检测,通过焦点损失和新颖正则化方案有效解决了标签不平衡问题,在精度和效率上均优于现有方法。
English: This study introduces Mamba-OTR, a model based on the Mamba architecture for online detection of object take and release in untrimmed egocentric videos, effectively addressing challenges like label imbalance and computational efficiency while outperforming existing methods in accuracy and speed.
Authors:Jiaqi Han, Haotian Ye, Puheng Li, Minkai Xu, James Zou, Stefano Ermon
Abstract:
Diffusion-based generative models have become dominant generators of high-fidelity images and videos but remain limited by their computationally expensive inference procedures. Existing acceleration techniques either require extensive model retraining or compromise significantly on sample quality. This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism. Our framework views multi-core diffusion sampling as an ODE solver pipeline, where slower yet accurate solvers progressively rectify faster solvers through a theoretically justified inter-core communication mechanism. This motivates our multi-core training-free diffusion sampling accelerator, CHORDS, which is compatible with various diffusion samplers, model architectures, and modalities. Through extensive experiments, CHORDS significantly accelerates sampling across diverse large-scale image and video diffusion models, yielding up to 2.1x speedup with four cores, improving by 50% over baselines, and 2.9x speedup with eight cores, all without quality degradation. This advancement enables CHORDS to establish a solid foundation for real-time, high-fidelity diffusion generation.
Chinese: 本文提出了CHORDS,一种无需训练、模型无关的加速框架,通过多核并行技术显著提升扩散生成模型的采样速度,最高可达2.9倍加速且不损失生成质量。
English: This paper introduces CHORDS, a training-free, model-agnostic acceleration framework that uses multi-core parallelism to speed up diffusion-based generative models, achieving up to 2.9x faster sampling without compromising quality.
Authors:Antonio Finocchiaro, Alessandro Sebastiano Catinello, Michele Mazzamuto, Rosario Leonardi, Antonino Furnari, Giovanni Maria Farinella
Abstract:
Hand-object interaction detection remains an open challenge in real-time applications, where intuitive user experiences depend on fast and accurate detection of interactions with surrounding objects. We propose an efficient approach for detecting hand-objects interactions from streaming egocentric vision that operates in real time. Our approach consists of an action recognition module and an object detection module for identifying active objects upon confirmed interaction. Our Mamba model with EfficientNetV2 as backbone for action recognition achieves 38.52% p-AP on the ENIGMA-51 benchmark at 30fps, while our fine-tuned YOLOWorld reaches 85.13% AP for hand and object. We implement our models in a cascaded architecture where the action recognition and object detection modules operate sequentially. When the action recognition predicts a contact state, it activates the object detection module, which in turn performs inference on the relevant frame to detect and classify the active object.
中文摘要:该研究提出了一种实时手物交互检测方法,通过动作识别与目标检测的级联系统,分别在基准测试中实现了38.52% p-AP和85.13% AP的精度。
English Summary: The proposed method enables real-time hand-object interaction detection through a cascaded system of action recognition and object detection modules, achieving 38.52% p-AP and 85.13% AP respectively on benchmark tests.
Authors:Tianyu Song, Feng Li, Yuan Bi, Angelos Karlas, Amir Yousefi, Daniela Branzan, Zhongliang Jiang, Ulrich Eck, Nassir Navab
Abstract:
The advancement and maturity of large language models (LLMs) and robotics have unlocked vast potential for human-computer interaction, particularly in the field of robotic ultrasound. While existing research primarily focuses on either patient-robot or physician-robot interaction, the role of an intelligent virtual sonographer (IVS) bridging physician-robot-patient communication remains underexplored. This work introduces a conversational virtual agent in Extended Reality (XR) that facilitates real-time interaction between physicians, a robotic ultrasound system(RUS), and patients. The IVS agent communicates with physicians in a professional manner while offering empathetic explanations and reassurance to patients. Furthermore, it actively controls the RUS by executing physician commands and transparently relays these actions to the patient. By integrating LLM-powered dialogue with speech-to-text, text-to-speech, and robotic control, our system enhances the efficiency, clarity, and accessibility of robotic ultrasound acquisition. This work constitutes a first step toward understanding how IVS can bridge communication gaps in physician-robot-patient interaction, providing more control and therefore trust into physician-robot interaction while improving patient experience and acceptance of robotic ultrasound.
中文: 本研究通过扩展现实中的智能虚拟超声医生,利用大语言模型驱动的对话和机器人控制实现医-机-患三方实时交互,在提升机器人超声检查效率的同时改善了患者体验。
English: This research introduces an intelligent virtual sonographer in Extended Reality that bridges physician-robot-patient communication by enabling real-time interaction through LLM-powered dialogue and robotic control, enhancing both procedural efficiency and patient experience in robotic ultrasound.
Authors:Bevan Koopman, Guido Zuccon
Abstract:
Despite widespread debunking, many psychological myths remain deeply entrenched. This paper investigates whether Large Language Models (LLMs) mimic human behaviour of myth belief and explores methods to mitigate such tendencies. Using 50 popular psychological myths, we evaluate myth belief across multiple LLMs under different prompting strategies, including retrieval-augmented generation and swaying prompts. Results show that LLMs exhibit significantly lower myth belief rates than humans, though user prompting can influence responses. RAG proves effective in reducing myth belief and reveals latent debiasing potential within LLMs. Our findings contribute to the emerging field of Machine Psychology and highlight how cognitive science methods can inform the evaluation and development of LLM-based systems.
中文摘要:研究发现大型语言模型对心理学谬误的相信程度低于人类,但用户提示会影响其回答,而检索增强生成能有效减少谬误相信并揭示其潜在的去偏见能力。
English Summary: This study finds that large language models exhibit lower belief in psychological myths than humans, but their responses can be influenced by user prompts, with retrieval-augmented generation effectively reducing myth belief and revealing latent debiasing capabilities.
Authors:Antonio Finocchiaro, Giovanni Maria Farinella, Antonino Furnari
Abstract:
Calisthenics skill classification is the computer vision task of inferring the skill performed by an athlete from images, enabling automatic performance assessment and personalized analytics. Traditional methods for calisthenics skill recognition are based on pose estimation methods to determine the position of skeletal data from images, which is later fed to a classification algorithm to infer the performed skill. Despite the progress in human pose estimation algorithms, they still involve high computational costs, long inference times, and complex setups, which limit the applicability of such approaches in real-time applications or mobile devices. This work proposes a direct approach to calisthenics skill recognition, which leverages depth estimation and athlete patch retrieval to avoid the computationally expensive human pose estimation module. Using Depth Anything V2 for depth estimation and YOLOv10 for athlete localization, we segment the subject from the background rather than relying on traditional pose estimation techniques. This strategy increases efficiency, reduces inference time, and improves classification accuracy. Our approach significantly outperforms skeleton-based methods, achieving 38.3x faster inference with RGB image patches and improved classification accuracy with depth patches (0.837 vs. 0.815). Beyond these performance gains, the modular design of our pipeline allows for flexible replacement of components, enabling future enhancements and adaptation to real-world applications.
中文: 本研究提出了一种高效的自重训练动作识别方法,通过深度估计和运动员定位替代计算密集的姿态估计,实现了38.3倍的推理加速和更高准确率,同时模块化设计保持了实际应用的灵活性。
English: This study introduces an efficient calisthenics skill recognition method that replaces computationally intensive pose estimation with depth estimation and athlete localization, achieving 38.3x faster inference and higher accuracy while maintaining modular flexibility for real-world applications.
Authors:Antonio Finocchiaro, Giovanni Maria Farinella, Antonino Furnari
Abstract:
Calisthenics is a fast-growing bodyweight discipline that consists of different categories, one of which is focused on skills. Skills in calisthenics encompass both static and dynamic elements performed by athletes. The evaluation of static skills is based on their difficulty level and the duration of the hold. Automated tools able to recognize isometric skills from a video by segmenting them to estimate their duration would be desirable to assist athletes in their training and judges during competitions. Although the video understanding literature on action recognition through body pose analysis is rich, no previous work has specifically addressed the problem of calisthenics skill temporal video segmentation. This study aims to provide an initial step towards the implementation of automated tools within the field of Calisthenics. To advance knowledge in this context, we propose a dataset of video footage of static calisthenics skills performed by athletes. Each video is annotated with a temporal segmentation which determines the extent of each skill. We hence report the results of a baseline approach to address the problem of skill temporal segmentation on the proposed dataset. The results highlight the feasibility of the proposed problem, while there is still room for improvement.
中文摘要:本研究提出了一个用于从视频中自动分割静态健身技能时间段的数据库,初步验证了该方法的可行性,同时指出在识别这些等长运动方面仍有改进空间。
English Summary: This study introduces a dataset for automated temporal segmentation of static calisthenics skills from videos, demonstrating initial feasibility while acknowledging room for improvement in recognizing these isometric movements.
Authors:Munjung Kim, Marios Constantinides, Sanja Å ÄepanoviÄ, Yong-Yeol Ahn, Daniele Quercia
Abstract:
The rapid rise of AI is poised to disrupt the labor market. However, AI is not a monolith; its impact depends on both the nature of the innovation and the jobs it affects. While computational approaches are emerging, there is no consensus on how to systematically measure an innovation's disruptive potential. Here, we calculate the disruption index of 3,237 U.S. AI patents (2015-2022) and link them to job tasks to distinguish between "consolidating" AI innovations that reinforce existing structures and "disruptive" AI innovations that alter them. Our analysis reveals that consolidating AI primarily targets physical, routine, and solo tasks, common in manufacturing and construction in the Midwest and central states. By contrast, disruptive AI affects unpredictable and mental tasks, particularly in coastal science and technology sectors. Surprisingly, we also find that disruptive AI disproportionately affects areas already facing skilled labor shortages, suggesting disruptive AI technologies may accelerate change where workers are scarce rather than replacing a surplus. Ultimately, consolidating AI appears to extend current automation trends, while disruptive AI is set to transform complex mental work, with a notable exception for collaborative tasks.
中文: 人工智能创新分为巩固型(针对制造业常规体力任务)和颠覆型(改变科技领域复杂脑力工作),后者对已面临熟练劳动力短缺的地区影响尤为显著。
English: AI innovations are categorized into consolidating types that automate routine physical tasks in manufacturing and disruptive types that transform complex mental work in tech sectors, with the latter disproportionately affecting regions already experiencing skilled labor shortages.
Authors:Wenjie Tian, Xinfa Zhu, Haohe Liu, Zhixian Zhao, Zihao Chen, Chaofan Ding, Xinhan Di, Junjie Zheng, Lei Xie
Abstract:
While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech within a unified framework. To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. Specifically, our proposed cross-modal aligner employs causal and non-causal attention mechanisms to improve synchronization and acoustic harmony. Besides, to handle data scarcity, we design a curriculum learning strategy that progressively builds the multimodal capability. Finally, we introduce DualBench, the first benchmark for V2ST evaluation with a carefully curated test set and comprehensive metrics. Experimental results demonstrate that DualDub achieves state-of-the-art performance, generating high-quality and well-synchronized soundtracks with both speech and background audio.
中文: 本文提出DualDub统一框架,通过跨模态对齐和渐进式学习策略,实现了视频到音轨的同步背景音与语音生成,在全新基准测试中展现出最优性能。
English: This paper introduces DualDub, a unified framework for video-to-soundtrack generation that simultaneously produces synchronized background audio and speech, achieving state-of-the-art performance through cross-modal alignment and curriculum learning.
Authors:Fedor V. Fomin, Petr A. Golovach, Laure Morelle, Dimitrios M. Thilikos
Abstract:
We introduce a series of graph decompositions based on the modulator/target scheme of modification problems that enable several algorithmic applications that parametrically extend the algorithmic potential of planarity. In the core of our approach is a polynomial time algorithm for computing planar H-modulators. Given a graph class H, a planar H-modulator of a graph G is a set X \subseteq V(G) such that the ``torso'' of X is planar and all connected components of G - X belong to H. Here, the torso of X is obtained from G[X] if, for every connected component of G-X, we form a clique out of its neighborhood on G[X]. We introduce H-Planarity as the problem of deciding whether a graph G has a planar H-modulator. We prove that, if H is hereditary, CMSO-definable, and decidable in polynomial time, then H-Planarity is solvable in polynomial time. Further, we introduce two parametric extensions of H-Planarity by defining the notions of H-planar treedepth and H-planar treewidth, which generalize the concepts of elimination distance and tree decompositions to the class H. Combining this result with existing FPT algorithms for various H-modulator problems, we thereby obtain FPT algorithms parameterized by H-planar treedepth and H-planar treewidth for numerous graph classes H. By combining the well-known algorithmic properties of planar graphs and graphs of bounded treewidth, our methods for computing H-planar treedepth and H-planar treewidth lead to a variety of algorithmic applications. For instance, once we know that a given graph has bounded H-planar treedepth or bounded H-planar treewidth, we can derive additive approximation algorithms for graph coloring and polynomial-time algorithms for counting (weighted) perfect matchings. Furthermore, we design Efficient Polynomial-Time Approximation Schemes (EPTAS-es) for several problems, including Maximum Independent Set.
中文摘要:本文提出基于平面H调节器的图分解方法,扩展了平面性的算法应用,当参数化为H平面树深或树宽时,可为多种图问题提供多项式时间解和FPT算法。
English Summary: This paper introduces graph decompositions using planar H-modulators that extend planarity's algorithmic applications, enabling polynomial-time solutions for H-Planarity and FPT algorithms for various graph problems when parameterized by H-planar treedepth or treewidth.
Authors:Meihua Dang, Jiaqi Han, Minkai Xu, Kai Xu, Akash Srivastava, Stefano Ermon
Abstract:
Discrete diffusion models have emerged as a powerful paradigm for language modeling, rivaling auto-regressive models by training-time scaling. However, inference-time scaling in discrete diffusion models remains relatively under-explored. In this work, we study sampling-based approaches for achieving high-quality text generation from discrete diffusion models in reward-guided settings. We introduce a novel inference-time scaling approach based on particle Gibbs sampling for discrete diffusion models. The particle Gibbs sampling algorithm iteratively refines full diffusion trajectories using conditional Sequential Monte Carlo as its transition mechanism. This process ensures that the updated samples progressively improve and move closer to the reward-weighted target distribution. Unlike existing inference-time scaling methods, which are often limited to single diffusion trajectories, our approach leverages iterative refinement across multiple trajectories. Within this framework, we further analyze the trade-offs between four key axes for inference-time scaling under fixed compute budgets: particle Gibbs iterations, particle count, denoising steps, and reward estimation cost. Empirically, our method consistently outperforms prior inference-time strategies on reward-guided text generation tasks, achieving significant improvement in accuracy under varying compute budgets.
Chinese: 离散扩散模型虽在性能上媲美自回归模型,但推理控制研究不足,为此提出的PG-DLM算法通过轨迹级优化实现奖励引导生成,同时保持困惑度,实证表明其在多种任务和计算预算下均优于现有方法。
English: Discrete diffusion models have achieved performance comparable to autoregressive models but lack effective inference-time control, prompting the development of PG-DLM, a novel algorithm that enables trajectory-level refinement for reward-guided generation while maintaining perplexity, with empirical results showing superior performance across various tasks and compute budgets.
Authors:Meihua Dang, Jiaqi Han, Minkai Xu, Kai Xu, Akash Srivastava, Stefano Ermon
Abstract:
Discrete diffusion models have recently emerged as strong alternatives to autoregressive language models, matching their performance through large-scale training. However, inference-time control remains relatively underexplored. In this work, we study how to steer generation toward desired rewards without retraining the models. Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement. We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity under reward optimization. PG-DLM constructs a Markov chain over full denoising trajectories and applies a conditional sequential Monte Carlo kernel to resample them. We derive theoretical guarantees for convergence, including asymptotic consistency and variance bounds. Within this framework, we further analyze trade-offs across four key axes for inference-time scaling under fixed budgets: iterations, samples, denoising steps, and reward estimation. Our analysis shows scaling iterations achieves the best reward-perplexity trade-off. Empirically, PG-DLM consistently outperforms prior methods using MDLM and LLaDA-8B as base models across a wide range of compute budgets for reward-guided generation tasks including toxicity and sentiment control as well as linguistic acceptability.
Chinese: 离散扩散模型虽在性能上媲美自回归模型,但推理控制研究不足,为此提出的PG-DLM算法通过轨迹级优化实现奖励引导生成,同时保持困惑度,实证表明其在多种任务和计算预算下均优于现有方法。
English: Discrete diffusion models have achieved performance comparable to autoregressive models but lack effective inference-time control, prompting the development of PG-DLM, a novel algorithm that enables trajectory-level refinement for reward-guided generation while maintaining perplexity, with empirical results showing superior performance across various tasks and compute budgets.
Authors:Binxu Li, Minkai Xu, Meihua Dang, Stefano Ermon
Abstract:
Diffusion models have achieved remarkable success in generating realistic and versatile images from text prompts. Inspired by the recent advancements of language models, there is an increasing interest in further improving the models by aligning with human preferences. However, we investigate alignment from a divergence minimization perspective and reveal that existing preference optimization methods are typically trapped in suboptimal mean-seeking optimization. In this paper, we introduce Divergence Minimization Preference Optimization (DMPO), a novel and principled method for aligning diffusion models by minimizing reverse KL divergence, which asymptotically enjoys the same optimization direction as original RL. We provide rigorous analysis to justify the effectiveness of DMPO and conduct comprehensive experiments to validate its empirical strength across both human evaluations and automatic metrics. Our extensive results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques, specifically outperforming all existing diffusion alignment baselines by at least 64.6% in PickScore across all evaluation datasets, demonstrating the method's superiority in aligning generative behavior with desired outputs. Overall, DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.
中文: 本文提出的DMPO方法通过最小化反向KL散度来对齐扩散模型,在不同基础模型和测试集上均优于现有技术,在人类评估和自动指标中展现出卓越性能。
English: This paper introduces DMPO, a novel method that aligns diffusion models by minimizing reverse KL divergence, consistently outperforming existing techniques in human evaluations and automatic metrics across various models and test sets.
Authors:Binxu Li, Minkai Xu, Jiaqi Han, Meihua Dang, Stefano Ermon
Abstract:
Diffusion models have achieved remarkable success in generating realistic and versatile images from text prompts. Inspired by the recent advancements of language models, there is an increasing interest in further improving the models by aligning with human preferences. However, we investigate alignment from a divergence minimization perspective and reveal that existing preference optimization methods are typically trapped in suboptimal mean-seeking optimization. In this paper, we introduce Divergence Minimization Preference Optimization (DMPO), a novel and principled method for aligning diffusion models by minimizing reverse KL divergence, which asymptotically enjoys the same optimization direction as original RL. We provide rigorous analysis to justify the effectiveness of DMPO and conduct comprehensive experiments to validate its empirical strength across both human evaluations and automatic metrics. Our extensive results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques, specifically consistently outperforming all baseline models across different base models and test sets, achieving the best PickScore in every case, demonstrating the method's superiority in aligning generative behavior with desired outputs. Overall, DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.
中文: 本文提出的DMPO方法通过最小化反向KL散度来对齐扩散模型,在不同基础模型和测试集上均优于现有技术,在人类评估和自动指标中展现出卓越性能。
English: This paper introduces DMPO, a novel method that aligns diffusion models by minimizing reverse KL divergence, consistently outperforming existing techniques in human evaluations and automatic metrics across various models and test sets.
Authors:Xuesong Li, Nassir Navab, Zhongliang Jiang
Abstract:
Image denoising is a fundamental task in computer vision, particularly in medical ultrasound (US) imaging, where speckle noise significantly degrades image quality. Although recent advancements in deep neural networks have led to substantial improvements in denoising for natural images, these methods cannot be directly applied to US speckle noise, as it is not purely random. Instead, US speckle arises from complex wave interference within the body microstructure, making it tissue-dependent. This dependency means that obtaining two independent noisy observations of the same scene, as required by pioneering Noise2Noise, is not feasible. Additionally, blind-spot networks also cannot handle US speckle noise due to its high spatial dependency. To address this challenge, we introduce Speckle2Self, a novel self-supervised algorithm for speckle reduction using only single noisy observations. The key insight is that applying a multi-scale perturbation (MSP) operation introduces tissue-dependent variations in the speckle pattern across different scales, while preserving the shared anatomical structure. This enables effective speckle suppression by modeling the clean image as a low-rank signal and isolating the sparse noise component. To demonstrate its effectiveness, Speckle2Self is comprehensively compared with conventional filter-based denoising algorithms and SOTA learning-based methods, using both realistic simulated US images and human carotid US images. Additionally, data from multiple US machines are employed to evaluate model generalization and adaptability to images from unseen domains. Project page: https://noseefood.github.io/us-speckle2self/
中文摘要:Speckle2Self是一种新型自监督算法,通过多尺度扰动操作从单幅噪声图像中分离组织依赖性斑点噪声与解剖结构,实现超声图像去噪。
English Summary: Speckle2Self is a self-supervised algorithm that uses multi-scale perturbation to reduce ultrasound speckle noise from single noisy observations by separating tissue-dependent noise from anatomical structures.
Authors:Matteo Tiezzi, Tommaso Apicella, Carlos Cardenas-Perez, Giovanni Fregonese, Stefano Dafarra, Pietro Morerio, Daniele Pucci, Alessio Del Bue
Abstract:
Evaluating and comparing the performance of autonomous Humanoid Robots is challenging, as success rate metrics are difficult to reproduce and fail to capture the complexity of robot movement trajectories, critical in Human-Robot Interaction and Collaboration (HRIC). To address these challenges, we propose a general evaluation framework that measures the quality of Imitation Learning (IL) methods by focusing on trajectory performance. We devise the Neural Meta Evaluator (NeME), a deep learning model trained to classify actions from robot joint trajectories. NeME serves as a meta-evaluator to compare the performance of robot control policies, enabling policy evaluation without requiring human involvement in the loop. We validate our framework on ergoCub, a humanoid robot, using teleoperation data and comparing IL methods tailored to the available platform. The experimental results indicate that our method is more aligned with the success rate obtained on the robot than baselines, offering a reproducible, systematic, and insightful means for comparing the performance of multimodal imitation learning approaches in complex HRI tasks.
Chinese: 本文提出了一种通用评估框架,通过神经元评估器(NeME)分析机器人关节轨迹来系统评估模仿学习方法,为仿人机器人在人机交互任务中的性能提供了一种可重复且深入的评估手段,替代了传统的成功率指标。
English: This paper introduces a general evaluation framework using a Neural Meta Evaluator (NeME) to systematically assess imitation learning methods by analyzing robot joint trajectories, providing a reproducible and insightful alternative to traditional success rate metrics for humanoid robot performance in HRI tasks.
Authors:Hongbao Li, Ziye Jia, Sijie He, Kun Guo, Qihui Wu
Abstract:
With the emergence of compute-intensive and delay-sensitive applications in vehicular networks, unmanned aerial vehicles (UAVs) have emerged as a promising complement for vehicular edge computing due to the high mobility and flexible deployment. However, the existing UAV-assisted offloading strategies are insufficient in coordinating heterogeneous computing resources and adapting to dynamic network conditions. Hence, this paper proposes a dual-layer UAV-assisted edge computing architecture based on partial offloading, composed of the relay capability of high-altitude UAVs and the computing support of low-altitude UAVs. The proposed architecture enables efficient integration and coordination of heterogeneous resources. A joint optimization problem is formulated to minimize the system delay and energy consumption while ensuring the task completion rate. To solve the high-dimensional decision problem, we reformulate the problem as a Markov decision process and propose a hierarchical offloading scheme based on the soft actor-critic algorithm. The method decouples global and local decisions, where the global decisions integrate offloading ratios and trajectory planning into continuous actions, while the local scheduling is handled via designing a priority-based mechanism. Simulations are conducted and demonstrate that the proposed approach outperforms several baselines in task completion rate, system efficiency, and convergence speed, showing strong robustness and applicability in dynamic vehicular environments.
中文: 本文提出了一种双层无人机辅助的边缘计算架构,利用高空无人机进行中继、低空无人机提供计算支持,并基于软演员-评论家算法设计了分层卸载方案,有效降低了系统延迟和能耗,在动态车联网环境中显著提升了任务完成率与系统鲁棒性。
English: This paper introduces a dual-layer UAV-assisted edge computing architecture that leverages high-altitude UAVs for relaying and low-altitude UAVs for computation, employing a hierarchical offloading scheme based on the soft actor-critic algorithm to minimize system delay and energy consumption while enhancing task completion rates and robustness in dynamic vehicular networks.
Authors:Sanyam Vyas, Alberto Caron, Chris Hicks, Pete Burnap, Vasilios Mavroudis
Abstract:
Deep Reinforcement Learning (DRL) systems are increasingly used in safety-critical applications, yet their security remains severely underexplored. This work investigates backdoor attacks, which implant hidden triggers that cause malicious actions only when specific inputs appear in the observation space. Existing DRL backdoor research focuses solely on training-time attacks requiring unrealistic access to the training pipeline. In contrast, we reveal critical vulnerabilities across the DRL supply chain where backdoors can be embedded with significantly reduced adversarial privileges. We introduce two novel attacks: (1) TrojanentRL, which exploits component-level flaws to implant a persistent backdoor that survives full model retraining; and (2) InfrectroRL, a post-training backdoor attack which requires no access to training, validation, nor test data. Empirical and analytical evaluations across six Atari environments show our attacks rival state-of-the-art training-time backdoor attacks while operating under much stricter adversarial constraints. We also demonstrate that InfrectroRL further evades two leading DRL backdoor defenses. These findings challenge the current research focus and highlight the urgent need for robust defenses.
中文摘要:本研究揭示了深度强化学习系统中的关键安全漏洞,提出了两种新型后门攻击方法——TrojanentRL和InfrectroRL,这些攻击在极低攻击权限下仍能实现与训练阶段攻击相当的效果,并能规避现有防御机制。
English Summary: This study exposes critical vulnerabilities in Deep Reinforcement Learning systems by introducing two novel backdoor attacks—TrojanentRL and InfrectroRL—that operate with minimal adversarial access while matching the effectiveness of existing training-time attacks and evading current defenses.
Authors:Jiaqi Han, Austin Wang, Minkai Xu, Wenda Chu, Meihua Dang, Yisong Yue, Stefano Ermon
Abstract:
Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose a novel preference optimization method for masked discrete diffusion models through a principled diffusion trajectory alignment. Instead of applying the reward on the final output and backpropagating the gradient to the entire discrete denoising process, we decompose the problem into a set of stepwise alignment objectives. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, guarantees an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12\% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 80.7 on LLaDA-8B-Instruct for language modeling.
中文摘要:本研究提出一种离散扩散模型的离线偏好优化方法,通过将奖励分解为逐步对齐目标来实现轨迹对齐,在DNA序列设计、蛋白质折叠和语言建模等任务中均实现了性能突破。
English Summary: This study introduces an offline preference optimization method for discrete diffusion models that aligns trajectories by decomposing rewards into stepwise objectives, achieving superior performance across DNA sequence design, protein folding, and language modeling tasks.
Authors:Jiaqi Han, Austin Wang, Minkai Xu, Wenda Chu, Meihua Dang, Yisong Yue, Stefano Ermon
Abstract:
Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose an offline preference optimization method to approach trajectory alignment for discrete diffusion models. Instead of applying the reward on the final output and backpropagating the gradient to the entire denoising process, we decompose the problem into a set of stepwise alignment objectives by matching the per-step posterior. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, yields an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12\% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 81.2 on LLaDA-8B-Instruct for language modeling.
中文摘要:本研究提出一种离散扩散模型的离线偏好优化方法,通过将奖励分解为逐步对齐目标来实现轨迹对齐,在DNA序列设计、蛋白质折叠和语言建模等任务中均实现了性能突破。
English Summary: This study introduces an offline preference optimization method for discrete diffusion models that aligns trajectories by decomposing rewards into stepwise objectives, achieving superior performance across DNA sequence design, protein folding, and language modeling tasks.
Authors:Xin You, Runze Yang, Chuyan Zhang, Zhongliang Jiang, Jie Yang, Nassir Navab
Abstract:
The temporal interpolation task for 4D medical imaging, plays a crucial role in clinical practice of respiratory motion modeling. Following the simplified linear-motion hypothesis, existing approaches adopt optical flow-based models to interpolate intermediate frames. However, realistic respiratory motions should be nonlinear and quasi-periodic with specific frequencies. Intuited by this property, we resolve the temporal interpolation task from the frequency perspective, and propose a Fourier basis-guided Diffusion model, termed FB-Diff. Specifically, due to the regular motion discipline of respiration, physiological motion priors are introduced to describe general characteristics of temporal data distributions. Then a Fourier motion operator is elaborately devised to extract Fourier bases by incorporating physiological motion priors and case-specific spectral information in the feature space of Variational Autoencoder. Well-learned Fourier bases can better simulate respiratory motions with motion patterns of specific frequencies. Conditioned on starting and ending frames, the diffusion model further leverages well-learned Fourier bases via the basis interaction operator, which promotes the temporal interpolation task in a generative manner. Extensive results demonstrate that FB-Diff achieves state-of-the-art (SOTA) perceptual performance with better temporal consistency while maintaining promising reconstruction metrics. Codes are available.
中文: 提出的FB-Diff模型采用傅里叶基引导的扩散方法,结合生理运动先验和频谱信息,在四维医学影像时间插值中实现了最先进的性能,确保了更好的时间一致性和感知质量。
English: The proposed FB-Diff model introduces a Fourier basis-guided diffusion approach that leverages physiological motion priors and spectral information to achieve state-of-the-art performance in 4D medical imaging temporal interpolation, ensuring better temporal consistency and perceptual quality.
Authors:Jiacheng Hao, Xiaoming Zhang, Wei Liu, Xiaoli Yin, Yuan Gao, Chunli Li, Ling Zhang, Le Lu, Yu Shi, Xu Han, Ke Yan
Abstract:
Focal liver lesions (FLL) are common clinical findings during physical examination. Early diagnosis and intervention of liver malignancies are crucial to improving patient survival. Although the current 3D segmentation paradigm can accurately detect lesions, it faces limitations in distinguishing between malignant and benign liver lesions, primarily due to its inability to differentiate subtle variations between different lesions. Furthermore, existing methods predominantly rely on specialized imaging modalities such as multi-phase contrast-enhanced CT and magnetic resonance imaging, whereas non-contrast CT (NCCT) is more prevalent in routine abdominal imaging. To address these limitations, we propose PLUS, a plug-and-play framework that enhances FLL analysis on NCCT images for arbitrary 3D segmentation models. In extensive experiments involving 8,651 patients, PLUS demonstrated a significant improvement with existing methods, improving the lesion-level F1 score by 5.66%, the malignant patient-level F1 score by 6.26%, and the benign patient-level F1 score by 4.03%. Our results demonstrate the potential of PLUS to improve malignant FLL screening using widely available NCCT imaging substantially.
中文: PLUS框架通过增强非对比CT图像上的肝脏局灶性病变分析,显著提高了恶性和良性病例在病灶及患者层面的分类准确率。
English: The proposed PLUS framework enhances focal liver lesion analysis on non-contrast CT images, significantly improving lesion and patient-level classification accuracy for malignant and benign cases.
Authors:Benjamin L. Davis, Mohamad Ali-Dib, Yujia Zheng, Zehao Jin, Kun Zhang, Andrea Valerio Macciò
Abstract:
The origins of the colors of Trans-Neptunian Objects (TNOs) represent a crucial unresolved question, central to understanding the history of our Solar System. Recent observational surveys have revealed correlations between the eccentricity and inclination of TNOs and their colors. This has rekindled the long-standing debate on whether these colors reflect the conditions of TNO formation or their subsequent collisional evolution. In this study, we address this question with 98.7% certainty, using a model-agnostic, data-driven approach based on causal graphs. First, as a sanity check, we demonstrate how our model can replicate the currently accepted paradigms of TNOs' dynamical history, blindly and without any orbital modeling or physics-based assumptions. In fact, our causal model (with no knowledge of the existence of Neptune) predicts the existence of an unknown perturbing body, i.e., Neptune. We then show how this model predicts, with high certainty, that the color of TNOs is the root cause of their inclination distribution, rather than the other way around. This strongly suggests that the colors of TNOs reflect an underlying dynamical property, most likely their formation location. Moreover, our causal model excludes formation scenarios that invoke substantial color modification by subsequent irradiation. We therefore conclude that the colors of TNOs are predominantly primordial.
中文: 该研究以98.7%的确定性证实,海王星外天体的颜色主要决定其轨道倾角分布,表明这些颜色具有原始性并反映了形成位置,而非后期演变结果。
English: The study establishes with 98.7% certainty that TNO colors primarily determine their orbital inclination, indicating their colors are primordial and reflect formation locations rather than later alterations.
Authors:Wenhao Shi, Zhiqiang Hu, Yi Bin, Yang Yang, See-Kiong Ng, Heng Tao Shen
Abstract:
Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair, fostering richer reasoning supervision. We further propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), a rule-based RL approach that integrates correctness discrimination and diversity-aware reward functions. Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions. Extensive experiments on the MathVista's minitest and Math-V benchmarks demonstrate that Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and generative diversity, highlighting the importance of incorporating diverse perspectives and reflective reasoning in multimodal mathematical reasoning.
中文摘要:近期大规模强化学习的进展通过引入包含多样化解题路径的数据集MathV-DP和基于群组相对策略优化的Qwen-VL-DP模型,显著提升了多模态大语言模型的数学推理能力,在准确性和生成多样性上均超越先前模型。
English Summary: Recent advancements in large-scale reinforcement learning have improved multimodal LLMs' mathematical reasoning by introducing MathV-DP, a dataset with diverse solution trajectories, and Qwen-VL-DP, a model enhanced through group relative policy optimization that outperforms predecessors in accuracy and diversity.
Authors:Yuntian Gu, Wenrui Li, Heng Lin, Bo Zhan, Ruichen Li, Yifei Huang, Di He, Yantao Wu, Tao Xiang, Mingpu Qin, Liwei Wang, Dingshun Lv
Abstract:
The rapid development of neural quantum states (NQS) has established it as a promising framework for studying quantum many-body systems. In this work, by leveraging the cutting-edge transformer-based architectures and developing highly efficient optimization algorithms, we achieve the state-of-the-art results for the doped two-dimensional (2D) Hubbard model, arguably the minimum model for high-Tc superconductivity. Interestingly, we find different attention heads in the NQS ansatz can directly encode correlations at different scales, making it capable of capturing long-range correlations and entanglements in strongly correlated systems. With these advances, we establish the half-filled stripe in the ground state of 2D Hubbard model with the next nearest neighboring hoppings, consistent with experimental observations in cuprates. Our work establishes NQS as a powerful tool for solving challenging many-fermions systems.
中文: 本研究通过采用基于Transformer的架构和高效优化算法,在掺杂二维Hubbard模型中取得了最先进成果,揭示了注意力头如何编码多尺度关联,并证实了与铜氧化物实验一致的半填充条纹基态。
English: This study advances neural quantum states by employing transformer architectures and optimization algorithms to achieve state-of-the-art results for the doped 2D Hubbard model, revealing how attention heads encode multi-scale correlations and confirming the half-filled stripe ground state consistent with cuprate experiments.
Authors:Alberto Caron, Chris Hicks, Vasilios Mavroudis
Abstract:
In this work, we address the challenge of data-efficient exploration in reinforcement learning by examining existing principled, information-theoretic approaches to intrinsic motivation. Specifically, we focus on a class of exploration bonuses that targets epistemic uncertainty rather than the aleatoric noise inherent in the environment. We prove that these bonuses naturally signal epistemic information gains and converge to zero once the agent becomes sufficiently certain about the environment's dynamics and rewards, thereby aligning exploration with genuine knowledge gaps. Our analysis provides formal guarantees for IG-based approaches, which previously lacked theoretical grounding. To enable practical use, we also discuss tractable approximations via sparse variational Gaussian Processes, Deep Kernels and Deep Ensemble models. We then outline a general framework - Predictive Trajectory Sampling with Bayesian Exploration (PTS-BE) - which integrates model-based planning with information-theoretic bonuses to achieve sample-efficient deep exploration. We empirically demonstrate that PTS-BE substantially outperforms other baselines across a variety of environments characterized by sparse rewards and/or purely exploratory tasks.
中文摘要:本文提出了一种基于信息论的强化学习探索方法,通过针对认知不确定性的奖励机制,开发了结合模型规划与信息奖励的PTS-BE框架,在稀疏奖励环境中实现了显著优于基准方法的样本效率。
English Summary: This paper introduces a principled approach to reinforcement learning exploration using information-theoretic bonuses that target epistemic uncertainty, proposing the PTS-BE framework which integrates model-based planning with these bonuses to achieve superior sample efficiency in sparse-reward environments.
Authors:Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu
Abstract:
We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on given seed tasks, and then generate a new synthetic example of similar quality and complexity. This is followed by a filtering step to select high-quality data using automatic metrics, which are then used for LLM training. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, when evaluated on MATH500, AMC23, AIME24, and GPQA-Diamond. For non-verifiable instruction-following tasks, our method surpasses the performance of both human and standard Self-Instruct training data on the AlpacaEval 2.0 and Arena-Hard benchmarks.
Chinese: CoT-Self-Instruct是一种基于思维链的合成数据生成方法,通过自动筛选高质量数据来训练大语言模型,在可验证推理和指令遵循任务上的表现均显著优于现有数据集。
English: CoT-Self-Instruct is a synthetic data generation method that uses Chain-of-Thought reasoning to create high-quality training examples for LLMs, significantly outperforming existing datasets in both verifiable reasoning and instruction-following tasks.
Authors:Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, Emad Barsoum
Abstract:
The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.
中文: 随着对AI生成GPU内核的需求增长,本研究提出了GEAK框架,利用先进大语言模型为AMD GPU生成高效的Triton代码,在正确性和执行速度上显著优于基线方法。
English: The demand for AI-generated GPU kernels is rising, and this work introduces GEAK, a framework using advanced LLMs to create efficient Triton code for AMD GPUs, which significantly outperforms baseline methods in correctness and speed.
Authors:Yicheng Guo, Chengkai Xu, Jiaqi Liu, Hao Zhang, Peng Hang, Jian Sun
Abstract:
Scientific testing techniques are essential for ensuring the safe operation of autonomous vehicles (AVs), with high-risk, highly interactive scenarios being a primary focus. To address the limitations of existing testing methods, such as their heavy reliance on high-quality test data, weak interaction capabilities, and low adversarial robustness, this paper proposes ExamPPO, an interactive adversarial testing framework that enables scenario-adaptive and intensity-controllable evaluation of autonomous vehicles. The framework models the Surrounding Vehicle (SV) as an intelligent examiner, equipped with a multi-head attention-enhanced policy network, enabling context-sensitive and sustained behavioral interventions. A scalar confrontation factor is introduced to modulate the intensity of adversarial behaviors, allowing continuous, fine-grained adjustment of test difficulty. Coupled with structured evaluation metrics, ExamPPO systematically probes AV's robustness across diverse scenarios and strategies. Extensive experiments across multiple scenarios and AV strategies demonstrate that ExamPPO can effectively modulate adversarial behavior, expose decision-making weaknesses in tested AVs, and generalize across heterogeneous environments, thereby offering a unified and reproducible solution for evaluating the safety and intelligence of autonomous decision-making systems.
中文摘要:本文提出ExamPPO交互式对抗测试框架,通过将周围车辆建模为智能考官并引入对抗因子调节测试强度,实现了自动驾驶汽车的场景自适应评估,能有效暴露决策缺陷并具备跨环境泛化能力。
English Summary: This paper introduces ExamPPO, an interactive adversarial testing framework that enables scenario-adaptive and intensity-controllable evaluation of autonomous vehicles by modeling surrounding vehicles as intelligent examiners, effectively exposing decision-making weaknesses while generalizing across diverse environments.
Authors:Ruiyang Hao, Haibao Yu, Jiaru Zhong, Chuanye Wang, Jiahao Wang, Yiming Kan, Wenxian Yang, Siqi Fan, Huilin Yin, Jianing Qiu, Yao Mu, Jiankai Sun, Li Chen, Walter Zimmer, Dandan Zhang, Shanghang Zhang, Mac Schwager, Ping Luo, Zaiqing Nie
Abstract:
With the rapid advancement of autonomous driving technology, vehicle-to-everything (V2X) communication has emerged as a key enabler for extending perception range and enhancing driving safety by providing visibility beyond the line of sight. However, integrating multi-source sensor data from both ego-vehicles and infrastructure under real-world constraints, such as limited communication bandwidth and dynamic environments, presents significant technical challenges. To facilitate research in this area, we organized the End-to-End Autonomous Driving through V2X Cooperation Challenge, which features two tracks: cooperative temporal perception and cooperative end-to-end planning. Built on the UniV2X framework and the V2X-Seq-SPD dataset, the challenge attracted participation from over 30 teams worldwide and established a unified benchmark for evaluating cooperative driving systems. This paper describes the design and outcomes of the challenge, highlights key research problems including bandwidth-aware fusion, robust multi-agent planning, and heterogeneous sensor integration, and analyzes emerging technical trends among top-performing solutions. By addressing practical constraints in communication and data fusion, the challenge contributes to the development of scalable and reliable V2X-cooperative autonomous driving systems.
中文: V2X合作挑战赛通过建立统一基准并聚焦带宽感知融合、多智能体规划等关键技术,推动自动驾驶系统在通信与数据融合的实际限制下向可扩展和可靠方向发展。
English: The V2X Cooperation Challenge advances autonomous driving by addressing real-world constraints like bandwidth limitations and dynamic environments through standardized benchmarks and innovative solutions in data fusion and multi-agent planning.
Authors:Shuo Zhang, Xiao Li, Jiayi Wu, Fan Yang, Xiang Li, Ming Gao
Abstract:
Session-based recommendation (SBR) is mainly based on anonymous user interaction sequences to recommend the items that the next user is most likely to click. Currently, the most popular and high-performing SBR methods primarily leverage graph neural networks (GNNs), which model session sequences as graph-structured data to effectively capture user intent. However, most GNNs-based SBR methods primarily focus on modeling the ID sequence information of session sequences, while neglecting the rich semantic information embedded within them. This limitation significantly hampers model's ability to accurately infer users' true intention. To address above challenge, this paper proposes a novel SBR approach called Integrating LLM-Derived Multi-Semantic Intent into Graph Model for Session-based Recommendation (LLM-DMsRec). The method utilizes a pre-trained GNN model to select the top-k items as candidate item sets and designs prompts along with a large language model (LLM) to infer multi-semantic intents from these candidate items. Specifically, we propose an alignment mechanism that effectively integrates the semantic intent inferred by the LLM with the structural intent captured by GNNs. Extensive experiments conducted on the Beauty and ML-1M datasets demonstrate that the proposed method can be seamlessly integrated into GNNs framework, significantly enhancing its recommendation performance.
中文摘要:本文提出LLM-DMsRec新方法,通过将大型语言模型推断的语义意图与图神经网络的结构模式相融合,有效提升基于会话推荐的用户意图理解能力和推荐性能。
English Summary: This paper introduces LLM-DMsRec, a novel session-based recommendation method that enhances graph neural networks by integrating large language model-derived semantic intents with structural patterns to better capture user intentions and improve recommendation accuracy.
Authors:Xin Hong, Longchao Da, Hua Wei
Abstract:
Urban flooding in arid regions poses severe risks to infrastructure and communities. Accurate, fine-scale mapping of flood extents and recovery trajectories is therefore essential for improving emergency response and resilience planning. However, arid environments often exhibit limited spectral contrast between water and adjacent surfaces, rapid hydrological dynamics, and highly heterogeneous urban land covers, which challenge traditional flood-mapping approaches. High-resolution, daily PlanetScope imagery provides the temporal and spatial detail needed. In this work, we introduce FM-LC, a hierarchical framework for Flood Mapping by Land Cover identification, for this challenging task. Through a three-stage process, it first uses an initial multi-class U-Net to segment imagery into water, vegetation, built area, and bare ground classes. We identify that this method has confusion between spectrally similar categories (e.g., water vs. vegetation). Second, by early checking, the class with the major misclassified area is flagged, and a lightweight binary expert segmentation model is trained to distinguish the flagged class from the rest. Third, a Bayesian smoothing step refines boundaries and removes spurious noise by leveraging nearby pixel information. We validate the framework on the April 2024 Dubai storm event, using pre- and post-rainfall PlanetScope composites. Experimental results demonstrate average F1-score improvements of up to 29% across all land-cover classes and notably sharper flood delineations, significantly outperforming conventional single-stage U-Net baselines.
中文: FM-LC框架通过多类分割与针对性二元模型及贝叶斯平滑相结合,显著提升了干旱城市区域的洪水测绘精度,在2024年迪拜暴雨事件验证中比传统方法F1分数提高29%。
English: The FM-LC framework enhances flood mapping in arid urban areas by combining multi-class segmentation with targeted binary models and Bayesian smoothing, achieving a 29% F1-score improvement over traditional methods in the 2024 Dubai storm analysis.
Authors:Suhwan Cho, Minhyeok Lee, Jungho Lee, Donghyeong Kim, Sangyoun Lee
Abstract:
Unsupervised video object segmentation (VOS) aims to detect the most prominent object in a video. Recently, two-stream approaches that leverage both RGB images and optical flow have gained significant attention, but their performance is fundamentally constrained by the scarcity of training data. To address this, we propose DepthFlow, a novel data generation method that synthesizes optical flow from single images. Our approach is driven by the key insight that VOS models depend more on structural information embedded in flow maps than on their geometric accuracy, and that this structure is highly correlated with depth. We first estimate a depth map from a source image and then convert it into a synthetic flow field that preserves essential structural cues. This process enables the transformation of large-scale image-mask pairs into image-flow-mask training pairs, dramatically expanding the data available for network training. By training a simple encoder-decoder architecture with our synthesized data, we achieve new state-of-the-art performance on all public VOS benchmarks, demonstrating a scalable and effective solution to the data scarcity problem.
中文摘要:DepthFlow通过利用深度信息从单张图像合成光流,解决了无监督视频对象分割中训练数据稀缺的问题,并在所有公开基准测试中实现了最先进的性能。
English Summary: DepthFlow addresses the scarcity of training data for unsupervised video object segmentation by synthesizing optical flow from single images using depth information, achieving state-of-the-art performance on benchmarks.
Authors:Suhwan Cho, Minhyeok Lee, Jungho Lee, Sunghun Yang, Sangyoun Lee
Abstract:
Video salient object detection (SOD) relies on motion cues to distinguish salient objects from backgrounds, but training such models is limited by scarce video datasets compared to abundant image datasets. Existing approaches that use spatial transformations to create video sequences from static images fail for motion-guided tasks, as these transformations produce unrealistic optical flows that lack semantic understanding of motion. We present TransFlow, which transfers motion knowledge from pre-trained video diffusion models to generate realistic training data for video SOD. Video diffusion models have learned rich semantic motion priors from large-scale video data, understanding how different objects naturally move in real scenes. TransFlow leverages this knowledge to generate semantically-aware optical flows from static images, where objects exhibit natural motion patterns while preserving spatial boundaries and temporal coherence. Our method achieves improved performance across multiple benchmarks, demonstrating effective motion knowledge transfer.
中文摘要:TransFlow通过从预训练视频扩散模型中迁移运动知识,生成具有语义感知光流的逼真训练数据,从而在多个基准测试中提升了视频显著性目标检测的性能。
English Summary: TransFlow transfers motion knowledge from pre-trained video diffusion models to generate realistic training data with semantically-aware optical flows, improving video salient object detection performance across multiple benchmarks.
Authors:Sijin Yu, Zijiao Chen, Wenxuan Wu, Shengxian Chen, Zhongliang Liu, Jingxin Nie, Xiaofen Xing, Xiangmin Xu, Xin Zhang
Abstract:
Reconstructing visual stimuli from human brain activity (e.g., fMRI) bridges neuroscience and computer vision by decoding neural representations. However, existing methods often overlook critical brain structure-function relationships, flattening spatial information and neglecting individual anatomical variations. To address these issues, we propose (1) a novel sphere tokenizer that explicitly models fMRI signals as spatially coherent 2D spherical data on the cortical surface; (2) integration of structural MRI (sMRI) data, enabling personalized encoding of individual anatomical variations; and (3) a positive-sample mixup strategy for efficiently leveraging multiple fMRI scans associated with the same visual stimulus. Collectively, these innovations enhance reconstruction accuracy, biological interpretability, and generalizability across individuals. Experiments demonstrate superior reconstruction performance compared to SOTA methods, highlighting the effectiveness and interpretability of our biologically informed approach.
中文摘要:本研究提出了一种生物学启发的方法,通过将fMRI信号建模于皮层表面的球面数据、整合个体解剖结构差异,并采用创新的训练策略,显著提升了从大脑活动重建视觉刺激的准确性和可解释性。
English Summary: This study introduces a biologically informed approach that enhances visual stimulus reconstruction from fMRI data by modeling brain activity on the cortical surface, integrating anatomical variations, and employing a novel training strategy to improve accuracy and interpretability.
Authors:Yue Lin, Xiaoxuan Zhang, Yang Liu, Dong Wang, Huchuan Lu
Abstract:
Like humans who rely on landmarks for orientation, autonomous robots depend on feature-rich environments for accurate localization. In this paper, we propose the GFM-Planner, a perception-aware trajectory planning framework based on the geometric feature metric, which enhances LiDAR localization accuracy by guiding the robot to avoid degraded areas. First, we derive the Geometric Feature Metric (GFM) from the fundamental LiDAR localization problem. Next, we design a 2D grid-based Metric Encoding Map (MEM) to efficiently store GFM values across the environment. A constant-time decoding algorithm is further proposed to retrieve GFM values for arbitrary poses from the MEM. Finally, we develop a perception-aware trajectory planning algorithm that improves LiDAR localization capabilities by guiding the robot in selecting trajectories through feature-rich areas. Both simulation and real-world experiments demonstrate that our approach enables the robot to actively select trajectories that significantly enhance LiDAR localization accuracy.
中文: GFM-Planner是一种基于几何特征度量的感知感知轨迹规划框架,通过引导机器人避开特征退化区域来增强LiDAR定位精度,仿真和实际实验均证明该方法能显著提升定位性能。
English: The GFM-Planner is a perception-aware trajectory planning framework that improves LiDAR localization by guiding robots through feature-rich areas using a geometric feature metric, with both simulations and real-world tests confirming its effectiveness in boosting localization accuracy.
Authors:Shahriar Golchin, Yanfei Chen, Rujun Han, Manan Gandhi, Tianli Yu, Swaroop Mishra, Mihai Surdeanu, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister
Abstract:
Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.
中文: 本研究提出了两种针对多示例上下文学习的高效演示选择策略,通过结合基于相似性和缓存随机或聚类中心点的演示,在最小计算开销下提升性能,显著降低推理成本的同时达到或超越现有方法的效果。
English: This study introduces two efficient demonstration selection strategies for many-shot in-context learning that enhance performance with minimal computational overhead by combining similarity-based and cached random or centroid-based demonstrations, significantly reducing inference costs while matching or surpassing existing methods.
Authors:Rujun Han, Yanfei Chen, Zoey CuiZhu, Lesly Miculicich, Guan Sun, Yuanjun Bi, Weiming Wen, Hui Wan, Chunfeng Wen, Solène Maître, George Lee, Vishy Tirumalashetty, Emily Xue, Zizhao Zhang, Salem Haykal, Burak Gokturk, Tomas Pfister, Chen-Yu Lee
Abstract:
Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.
中文摘要:提出的测试时扩散深度研究框架将研究报告生成视为迭代扩散过程,通过动态检索和自我进化算法持续优化草稿,在需要深度搜索和多步推理的基准测试中实现了最先进的性能。
English Summary: The proposed Test-Time Diffusion Deep Researcher (TTD-DR) framework enhances long-form research report generation by treating it as an iterative diffusion process, where an evolving draft is refined through dynamic retrieval and self-evolutionary algorithms, achieving state-of-the-art performance on complex benchmarks.
Authors:Enes Sanli, Baris Sarper Tezcan, Aykut Erdem, Erkut Erdem
Abstract:
Recent progress in text-to-video (T2V) generation has enabled the synthesis of visually compelling and temporally coherent videos from natural language. However, these models often fall short in basic physical commonsense, producing outputs that violate intuitive expectations around causality, object behavior, and tool use. Addressing this gap, we present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of T2V systems. The benchmark includes 383 carefully curated prompts, emphasizing tool use, material properties, and procedural interactions, and domains where physical plausibility is crucial. For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline: (1) formulate grounded physics questions from the prompt, (2) caption the generated video with a vision-language model, and (3) task a language model to answer several physics-involved questions using only the caption. This indirect strategy circumvents common hallucination issues in direct video-based evaluation. By highlighting affordances and tool-mediated actions, areas overlooked in current T2V evaluations, PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
Chinese: 针对文本到视频模型在物理常识方面的不足,PhysVidBench基准通过383个提示和三级评估流程,为生成视频的物理合理性提供了结构化评估框架。
English: Recent text-to-video models struggle with physical commonsense, so PhysVidBench was created as a benchmark with 383 prompts to evaluate these systems through a three-stage pipeline that avoids direct video hallucination issues.
Authors:Yiqi Zhao, Xinyi Yu, Bardh Hoxha, Georgios Fainekos, Jyotirmoy V. Deshmukh, Lars Lindemann
Abstract:
Multi-agent systems (MASs) consisting of a number of autonomous agents that communicate, coordinate, and jointly sense the environment to achieve complex missions can be found in a variety of applications such as robotics, smart cities, and internet-of-things applications. Modeling and monitoring MAS requirements to guarantee overall mission objectives, safety, and reliability is an important problem. Such requirements implicitly require reasoning about diverse sensing and communication modalities between agents, analysis of the dependencies between agent tasks, and the spatial or virtual distance between agents. To capture such rich MAS requirements, we model agent interactions via multiple directed graphs, and introduce a new logic -- Spatio-Temporal Logic with Graph Operators (STL-GO). The key innovation in STL-GO are graph operators that enable us to reason about the number of agents along either the incoming or outgoing edges of the underlying interaction graph that satisfy a given property of interest; for example, the requirement that an agent should sense at least two neighboring agents whose task graphs indicate the ability to collaborate. We then propose novel distributed monitoring conditions for individual agents that use only local information to determine whether or not an STL-GO specification is satisfied. We compare the expressivity of STL-GO against existing spatio-temporal logic formalisms, and demonstrate the utility of STL-GO and our distributed monitors in a bike-sharing and a multi-drone case study.
中文摘要:本文提出了一种新颖的时空逻辑STL-GO,通过图运算符建模多智能体系统的复杂需求,并开发了基于局部信息的分布式监测方法验证规范,通过共享单车和多无人机案例验证了其有效性。
English Summary: This paper introduces STL-GO, a novel spatio-temporal logic with graph operators for modeling complex multi-agent system requirements, and proposes distributed monitoring methods using local information to verify specifications, validated through case studies.
Authors:Derek Li, Jiaming Zhou, Amirreza Kazemi, Qianyi Sun, Abbas Ghaddar, Mohammad Ali Alomrani, Liheng Ma, Yu Luo, Dong Li, Feng Wen, Jianye Hao, Mark Coates, Yingxue Zhang
Abstract:
The advancement of general-purpose artificial intelligence relies on large language models (LLMs) that excel across a wide range of tasks, from structured reasoning to creative generation. However, post-training methods like Supervised Fine-Tuning (SFT) often struggle with generalization, favoring memorization over transferable learning. In this work, we introduce Omni-Thinker, a unified reinforcement learning (RL) framework that enhances LLM performance across diverse tasks by combining rule-based verifiable rewards with generative preference signals via LLM-as-a-Judge evaluations. Our approach enables consistent optimization across task types and scales RL-based training to subjective domains. We further investigate training strategies, demonstrating that a curriculum-based progression that orders tasks from structured to open-ended improves performance and reduces forgetting. Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging. These results highlight the importance of task-aware sampling and hybrid supervision in scaling RL-based post-training for general-purpose LLMs.
中文:Omni-Thinker是一个统一的强化学习框架,通过混合奖励和基于反向迁移的调度策略,提升大语言模型在多任务中的性能,显著优于现有训练方法。
English: Omni-Thinker is a unified reinforcement learning framework that enhances large language models' performance across diverse tasks through hybrid rewards and backward-transfer-guided scheduling, achieving significant improvements over existing methods.
Authors:Derek Li, Jiaming Zhou, Leo Maxime Brunswic, Abbas Ghaddar, Qianyi Sun, Liheng Ma, Yu Luo, Dong Li, Mark Coates, Jianye Hao, Yingxue Zhang
Abstract:
The pursuit of general-purpose artificial intelligence depends on large language models (LLMs) that can handle both structured reasoning and open-ended generation. We present Omni-Thinker, a unified reinforcement learning (RL) framework that scales LLMs across diverse tasks by combining hybrid rewards with backward-transfer-guided scheduling. Hybrid rewards integrate rule-based verifiable signals with preference-based evaluations from an LLM-as-a-Judge, enabling learning in both deterministic and subjective domains. Our scheduler orders tasks according to accuracy backward transfer (BWT), reducing forgetting and improving multi-task performance. Experiments across four domains show gains of 6.2% over joint training and 12.4% over model merging. Moreover, we demonstrate that simple assumptions on accuracy transfer yield accurate predictions of curriculum outcomes, with entropy dynamics explaining deviations due to generative tasks. These findings underscore the importance of BWT-aware scheduling and hybrid supervision for scaling RL-based post-training toward general-purpose LLMs.
中文:Omni-Thinker是一个统一的强化学习框架,通过混合奖励和基于反向迁移的调度策略,提升大语言模型在多任务中的性能,显著优于现有训练方法。
English: Omni-Thinker is a unified reinforcement learning framework that enhances large language models' performance across diverse tasks through hybrid rewards and backward-transfer-guided scheduling, achieving significant improvements over existing methods.
Authors:Federico Mason, Tommaso Zugno, Matteo Drago, Marco Giordani, Mate Boban, Michele Zorzi
Abstract:
Predictive Quality of Service (PQoS) makes it possible to anticipate QoS changes, e.g., in wireless networks, and trigger appropriate countermeasures to avoid performance degradation. Hence, PQoS is extremely useful for automotive applications such as teleoperated driving, which poses strict constraints in terms of latency and reliability. A promising tool for PQoS is given by Reinforcement Learning (RL), a methodology that enables the design of decision-making strategies for stochastic optimization. In this manuscript, we present PRATA, a new simulation framework to enable PRedictive QoS based on AI for Teleoperated driving Applications. PRATA consists of a modular pipeline that includes (i) an end-to-end protocol stack to simulate the 5G Radio Access Network (RAN), (ii) a tool for generating automotive data, and (iii) an Artificial Intelligence (AI) unit to optimize PQoS decisions. To prove its utility, we use PRATA to design an RL unit, named RAN-AI, to optimize the segmentation level of teleoperated driving data in the event of resource saturation or channel degradation. Hence, we show that the RAN-AI entity efficiently balances the trade-off between QoS and Quality of Experience (QoE) that characterize teleoperated driving applications, almost doubling the system performance compared to baseline approaches. In addition, by varying the learning settings of the RAN-AI entity, we investigate the impact of the state space and the relative cost of acquiring network data that are necessary for the implementation of RL.
中文摘要:预测性服务质量(PQoS)通过强化学习优化远程驾驶数据的分段处理,在资源受限时主动调整网络策略,相比基准方法可近乎双倍提升系统性能。
English Summary: Predictive Quality of Service (PQoS) enables proactive network management for teleoperated driving by using Reinforcement Learning to optimize data segmentation, significantly improving system performance over baseline methods.
Authors:Song Mao, Lejun Cheng, Pinlong Cai, Guohang Yan, Ding Wang, Botian Shi
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in various applications. However, their use as writing assistants in specialized domains like finance, medicine, and law is often hampered by a lack of deep domain-specific knowledge and a tendency to hallucinate. Existing solutions, such as Retrieval-Augmented Generation (RAG), can suffer from inconsistency across multiple retrieval steps, while online search-based methods often degrade quality due to unreliable web content. To address these challenges, we introduce DeepWriter, a customizable, multimodal, long-form writing assistant that operates on a curated, offline knowledge base. DeepWriter leverages a novel pipeline that involves task decomposition, outline generation, multimodal retrieval, and section-by-section composition with reflection. By deeply mining information from a structured corpus and incorporating both textual and visual elements, DeepWriter generates coherent, factually grounded, and professional-grade documents. We also propose a hierarchical knowledge representation to enhance retrieval efficiency and accuracy. Our experiments on financial report generation demonstrate that DeepWriter produces high-quality, verifiable articles that surpasses existing baselines in factual accuracy and generated content quality.
Chinese: DeepWriter是一款可定制的多模态写作助手,它利用离线知识库和创新的流程,在专业领域生成连贯且事实准确的文档,在质量和可靠性上超越了现有方法。
English: DeepWriter is a customizable, multimodal writing assistant that uses an offline knowledge base and a novel pipeline to generate coherent, factually accurate documents in specialized domains, outperforming existing methods in quality and reliability.
Authors:Yanchen Guan, Haicheng Liao, Chengyue Wang, Xingcheng Liu, Jiaxun Zhang, Zhenning Li
Abstract:
Reliable anticipation of traffic accidents is essential for advancing autonomous driving systems. However, this objective is limited by two fundamental challenges: the scarcity of diverse, high-quality training data and the frequent absence of crucial object-level cues due to environmental disruptions or sensor deficiencies. To tackle these issues, we propose a comprehensive framework combining generative scene augmentation with adaptive temporal reasoning. Specifically, we develop a video generation pipeline that utilizes a world model guided by domain-informed prompts to create high-resolution, statistically consistent driving scenarios, particularly enriching the coverage of edge cases and complex interactions. In parallel, we construct a dynamic prediction model that encodes spatio-temporal relationships through strengthened graph convolutions and dilated temporal operators, effectively addressing data incompleteness and transient visual noise. Furthermore, we release a new benchmark dataset designed to better capture diverse real-world driving risks. Extensive experiments on public and newly released datasets confirm that our framework enhances both the accuracy and lead time of accident anticipation, offering a robust solution to current data and modeling limitations in safety-critical autonomous driving applications.
中文:我们的框架通过视频生成管道创建逼真驾驶场景,并采用动态预测模型提升事故预判的准确性和提前量,经新基准数据集验证,显著增强了自动驾驶安全性。
English: Our framework enhances autonomous driving safety by generating realistic driving scenarios with a video pipeline and employing a dynamic prediction model to improve accident anticipation accuracy and lead time, validated on new benchmark datasets.
Authors:Yanchen Guan, Haicheng Liao, Chengyue Wang, Bonan Wang, Jiaxun Zhang, Jia Hu, Zhenning Li
Abstract:
Developing precise and computationally efficient traffic accident anticipation system is crucial for contemporary autonomous driving technologies, enabling timely intervention and loss prevention. In this paper, we propose an accident anticipation framework employing a dual-branch architecture that effectively integrates visual information from dashcam videos with structured textual data derived from accident reports. Furthermore, we introduce a feature aggregation method that facilitates seamless integration of multimodal inputs through large models (GPT-4o, Long-CLIP), complemented by targeted prompt engineering strategies to produce actionable feedback and standardized accident archives. Comprehensive evaluations conducted on benchmark datasets (DAD, CCD, and A3D) validate the superior predictive accuracy, enhanced responsiveness, reduced computational overhead, and improved interpretability of our approach, thus establishing a new benchmark for state-of-the-art performance in traffic accident anticipation.
中文摘要:本文提出一种双分支交通事故预警框架,通过大模型实现行车记录视频与事故报告文本的多模态特征融合,在基准数据集上验证了该方法具有更优的预测性能与计算效率。
English Summary: This paper introduces a dual-branch accident anticipation framework that integrates dashcam video data with accident report text through multimodal feature aggregation using large models, achieving superior predictive performance and computational efficiency on benchmark datasets.
Authors:Yuzhou Ji, Ke Ma, Hong Cai, Anchun Zhang, Lizhuang Ma, Xin Tan
Abstract:
Dynamic driving scene reconstruction is of great importance in fields like digital twin system and autonomous driving simulation. However, unacceptable degradation occurs when the view deviates from the input trajectory, leading to corrupted background and vehicle models. To improve reconstruction quality on novel trajectory, existing methods are subject to various limitations including inconsistency, deformation, and time consumption. This paper proposes LidarPainter, a one-step diffusion model that recovers consistent driving views from sparse LiDAR condition and artifact-corrupted renderings in real-time, enabling high-fidelity lane shifts in driving scene reconstruction. Extensive experiments show that LidarPainter outperforms state-of-the-art methods in speed, quality and resource efficiency, specifically 7 x faster than StreetCrafter with only one fifth of GPU memory required. LidarPainter also supports stylized generation using text prompts such as "foggy" and "night", allowing for a diverse expansion of the existing asset library.
中文摘要:LidarPainter是一种实时扩散模型,能从稀疏激光雷达数据中重建高保真驾驶场景,在速度和效率上超越现有方法,并通过文本提示实现风格化生成。
English Summary: LidarPainter is a real-time diffusion model that reconstructs high-fidelity driving scenes from sparse LiDAR data, outperforming existing methods in speed and efficiency while enabling stylized generation through text prompts.
Authors:Longchao Da, Xiangrui Liu, Mithun Shivakoti, Thirulogasankar Pranav Kutralingam, Yezhou Yang, Hua Wei
Abstract:
Heatwaves pose a significant threat to public health, especially as global warming intensifies. However, current routing systems (e.g., online maps) fail to incorporate shade information due to the difficulty of estimating shades directly from noisy satellite imagery and the limited availability of training data for generative models. In this paper, we address these challenges through two main contributions. First, we build an extensive dataset covering diverse longitude-latitude regions, varying levels of building density, and different urban layouts. Leveraging Blender-based 3D simulations alongside building outlines, we capture building shadows under various solar zenith angles throughout the year and at different times of day. These simulated shadows are aligned with satellite images, providing a rich resource for learning shade patterns. Second, we propose the DeepShade, a diffusion-based model designed to learn and synthesize shade variations over time. It emphasizes the nuance of edge features by jointly considering RGB with the Canny edge layer, and incorporates contrastive learning to capture the temporal change rules of shade. Then, by conditioning on textual descriptions of known conditions (e.g., time of day, solar angles), our framework provides improved performance in generating shade images. We demonstrate the utility of our approach by using our shade predictions to calculate shade ratios for real-world route planning in Tempe, Arizona. We believe this work will benefit society by providing a reference for urban planning in extreme heat weather and its potential practical applications in the environment.
中文: 本文提出DeepShade这一基于扩散的模型,通过3D模拟和对比学习从卫星图像生成精确的阴影图像,在应对极端高温的城市规划和路径优化方面具有应用潜力。
English: This paper introduces DeepShade, a diffusion-based model that generates accurate shade images from satellite data using 3D simulations and contrastive learning, with potential applications in heatwave-responsive urban planning and route optimization.
Authors:Francesco Paissan, Gordon Wichern, Yoshiki Masuyama, Ryo Aihara, François G. Germain, Kohei Saijo, Jonathan Le Roux
Abstract:
Time-Frequency (TF) dual-path models are currently among the best performing audio source separation network architectures, achieving state-of-the-art performance in speech enhancement, music source separation, and cinematic audio source separation. While they are characterized by a relatively low parameter count, they still require a considerable number of operations, implying a higher execution time. This problem is exacerbated by the trend towards bigger models trained on large amounts of data to solve more general tasks, such as the recently introduced task-aware unified source separation (TUSS) model. TUSS, which aims to solve audio source separation tasks using a single, conditional model, is built upon TF-Locoformer, a TF dual-path model combining convolution and attention layers. The task definition comes in the form of a sequence of prompts that specify the number and type of sources to be extracted. In this paper, we analyze the design choices of TUSS with the goal of optimizing its performance-complexity trade-off. We derive two more efficient models, FasTUSS-8.3G and FasTUSS-11.7G that reduce the original model's operations by 81\% and 73\% with minor performance drops of 1.2~dB and 0.4~dB averaged over all benchmarks, respectively. Additionally, we investigate the impact of prompt conditioning to derive a causal TUSS model.
中文: 本文提出了两种优化模型FasTUSS-8.3G和FasTUSS-11.7G,在保持性能损失最小的情况下,分别将计算量减少了81%和73%,有效提升了时频双路径架构在音频源分离任务中的效率。
English: The paper introduces two optimized models, FasTUSS-8.3G and FasTUSS-11.7G, which significantly reduce computational operations by 81% and 73% respectively while maintaining minimal performance drops, improving the efficiency of the time-frequency dual-path architecture for audio source separation.
Authors:Jialong Mai, Xiaofen Xing, Yawei Li, Weidong Chen, Zhipeng Li, Jingyuan Xing, Xiangmin Xu
Abstract:
Recent research has focused on applying speech large language model (SLLM) to improve speech emotion recognition (SER). However, the inherently high frame rate in speech modality severely limits the signal processing and understanding capabilities of SLLM. For example, a SLLM with a 4K context window can only process 80 seconds of audio at 50Hz feature sampling rate before reaching its capacity limit. Input token compression methods used in SLLM overlook the continuity and inertia of emotions across multiple conversation turns. This paper proposes a Dynamic Parameter Memory (DPM) mechanism with contextual semantics and sentence-level emotion encoding, enabling processing of unlimited-length audio with limited context windows in SLLM. Specifically, DPM progressively encodes sentence-level information and emotions into a temporary LoRA module during inference to effectively "memorize" the contextual information. We trained an emotion SLLM as a backbone and incorporated our DPM into inference for emotion recognition in conversation (ERC). Experimental results on the IEMOCAP dataset show that DPM significantly improves the emotion recognition capabilities of SLLM when processing long audio sequences, achieving state-of-the-art performance.
中文: 本文提出动态参数记忆机制,通过编码句子级上下文和情感信息,使语音大语言模型能够处理无限长度音频,在长序列情感识别中实现了最优性能。
English: This paper introduces a Dynamic Parameter Memory (DPM) mechanism that enables speech large language models to process unlimited-length audio by encoding sentence-level contextual and emotional information, achieving state-of-the-art emotion recognition performance on long sequences.
Authors:Yoshiki Masuyama, François G. Germain, Gordon Wichern, Christopher Ick, Jonathan Le Roux
Abstract:
This paper presents a physics-informed neural network (PINN) for modeling first-order Ambisonic (FOA) room impulse responses (RIRs). PINNs have demonstrated promising performance in sound field interpolation by combining the powerful modeling capability of neural networks and the physical principles of sound propagation. In room acoustics, PINNs have typically been trained to represent the sound pressure measured by omnidirectional microphones where the wave equation or its frequency-domain counterpart, i.e., the Helmholtz equation, is leveraged. Meanwhile, FOA RIRs additionally provide spatial characteristics and are useful for immersive audio generation with a wide range of applications. In this paper, we extend the PINN framework to model FOA RIRs. We derive two physics-informed priors for FOA RIRs based on the correspondence between the particle velocity and the (X, Y, Z)-channels of FOA. These priors associate the predicted W-channel and other channels through their partial derivatives and impose the physically feasible relationship on the four channels. Our experiments confirm the effectiveness of the proposed method compared with a neural network without the physics-informed prior.
本文提出了一种物理信息神经网络,通过引入基于粒子速度与声道对应关系的物理先验来建模一阶Ambisonic室内脉冲响应,实验证明该方法优于无物理先验的神经网络模型。
This paper introduces a physics-informed neural network (PINN) to model first-order Ambisonic room impulse responses by incorporating physical priors that enforce relationships between the four channels, demonstrating improved performance over conventional neural networks.
Authors:Jiaxun Zhang, Haicheng Liao, Yumu Xie, Chengyue Wang, Yanchen Guan, Bin Rao, Zhenning Li
Abstract:
Accurate accident anticipation remains challenging when driver cognition and dynamic road conditions are underrepresented in predictive models. In this paper, we propose CAMERA (Context-Aware Multi-modal Enhanced Risk Anticipation), a multi-modal framework integrating dashcam video, textual annotations, and driver attention maps for robust accident anticipation. Unlike existing methods that rely on static or environment-centric thresholds, CAMERA employs an adaptive mechanism guided by scene complexity and gaze entropy, reducing false alarms while maintaining high recall in dynamic, multi-agent traffic scenarios. A hierarchical fusion pipeline with Bi-GRU (Bidirectional GRU) captures spatio-temporal dependencies, while a Geo-Context Vision-Language module translates 3D spatial relationships into interpretable, human-centric alerts. Evaluations on the DADA-2000 and benchmarks show that CAMERA achieves state-of-the-art performance, improving accuracy and lead time. These results demonstrate the effectiveness of modeling driver attention, contextual description, and adaptive risk thresholds to enable more reliable accident anticipation.
中文:CAMERA框架通过融合多模态数据和自适应风险阈值,在动态交通场景中显著提升了事故预测的准确性和预警时间。
English: The CAMERA framework enhances accident prediction by integrating multi-modal data and adaptive risk thresholds, achieving superior accuracy and lead time in dynamic traffic scenarios.
Authors:Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, Maarten Sap
Abstract:
Recent advances in AI agents capable of solving complex, everyday tasks, from scheduling to customer service, have enabled deployment in real-world settings, but their possibilities for unsafe behavior demands rigorous evaluation. While prior benchmarks have attempted to assess agent safety, most fall short by relying on simulated environments, narrow task domains, or unrealistic tool abstractions. We introduce OpenAgentSafety, a comprehensive and modular framework for evaluating agent behavior across eight critical risk categories. Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms; and supports over 350 multi-turn, multi-user tasks spanning both benign and adversarial user intents. OpenAgentSafety is designed for extensibility, allowing researchers to add tools, tasks, websites, and adversarial strategies with minimal effort. It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors. Empirical analysis of five prominent LLMs in agentic scenarios reveals unsafe behavior in 51.2% of safety-vulnerable tasks with Claude-Sonnet-3.7, to 72.7% with o3-mini, highlighting critical safety vulnerabilities and the need for stronger safeguards before real-world deployment.
中文: 针对能够处理复杂任务的AI智能体,OpenAgentSafety这一模块化框架通过真实工具和多样化任务进行安全评估,发现主流大语言模型存在显著不安全行为,凸显了在实际部署前加强安全防护的必要性。
English: Recent AI agents capable of handling complex tasks require rigorous safety evaluation, leading to the development of OpenAgentSafety, a modular framework that tests agents using real tools and diverse tasks, revealing significant unsafe behaviors in prominent LLMs and underscoring the need for stronger safeguards.
Authors:Ruijie Lu, Yu Liu, Jiaxiang Tang, Junfeng Ni, Yuxiang Wang, Diwen Wan, Gang Zeng, Yixin Chen, Siyuan Huang
Abstract:
Generating articulated objects, such as laptops and microwaves, is a crucial yet challenging task with extensive applications in Embodied AI and AR/VR. Current image-to-3D methods primarily focus on surface geometry and texture, neglecting part decomposition and articulation modeling. Meanwhile, neural reconstruction approaches (e.g., NeRF or Gaussian Splatting) rely on dense multi-view or interaction data, limiting their scalability. In this paper, we introduce DreamArt, a novel framework for generating high-fidelity, interactable articulated assets from single-view images. DreamArt employs a three-stage pipeline: firstly, it reconstructs part-segmented and complete 3D object meshes through a combination of image-to-3D generation, mask-prompted 3D segmentation, and part amodal completion. Second, we fine-tune a video diffusion model to capture part-level articulation priors, leveraging movable part masks as prompt and amodal images to mitigate ambiguities caused by occlusion. Finally, DreamArt optimizes the articulation motion, represented by a dual quaternion, and conducts global texture refinement and repainting to ensure coherent, high-quality textures across all parts. Experimental results demonstrate that DreamArt effectively generates high-quality articulated objects, possessing accurate part shape, high appearance fidelity, and plausible articulation, thereby providing a scalable solution for articulated asset generation. Our project page is available at https://dream-art-0.github.io/DreamArt/.
中文: DreamArt是一种创新框架,通过三阶段流程从单视图图像生成高保真可交互的铰接式3D对象,包括部件分割网格重建、基于视频扩散的关节建模,以及结合运动优化和纹理修复的最终处理。
English: DreamArt is a novel framework that generates high-fidelity, articulated 3D objects from single-view images through a three-stage process involving part-segmented mesh reconstruction, articulation modeling via video diffusion, and motion optimization with texture refinement.
Authors:Yuhang Zhang, Jiaqi Liu, Chengkai Xu, Peng Hang, Jian Sun
Abstract:
A principal barrier to large-scale deployment of urban autonomous driving systems lies in the prevalence of complex scenarios and edge cases. Existing systems fail to effectively interpret semantic information within traffic contexts and discern intentions of other participants, consequently generating decisions misaligned with skilled drivers' reasoning patterns. We present LeAD, a dual-rate autonomous driving architecture integrating imitation learning-based end-to-end (E2E) frameworks with large language model (LLM) augmentation. The high-frequency E2E subsystem maintains real-time perception-planning-control cycles, while the low-frequency LLM module enhances scenario comprehension through multi-modal perception fusion with HD maps and derives optimal decisions via chain-of-thought (CoT) reasoning when baseline planners encounter capability limitations. Our experimental evaluation in the CARLA Simulator demonstrates LeAD's superior handling of unconventional scenarios, achieving 71 points on Leaderboard V1 benchmark, with a route completion of 93%.
中文:LeAD自动驾驶系统融合模仿学习与大语言模型,通过双速率架构提升复杂场景决策能力,在模拟测试中展现出处理非常规情况的卓越性能,完成率达93%。
English: The LeAD autonomous driving system combines imitation learning with large language models to enhance decision-making in complex urban scenarios, achieving superior performance in handling edge cases as demonstrated by its high scores in simulation tests.
Authors:Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho
Abstract:
As large language models (LLMs) become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In this paper, we reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping. During this phenomenon, the model gradually reduces the attention it allocates to unsafe requests in a user query during the attack process, ultimately causing a jailbreak. We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning. Additionally, we evaluate two defenses based on query perturbation, Token Highlighter and SmoothLLM, and find they indirectly mitigate Attention Slipping, with their effectiveness positively correlated with the degree of mitigation achieved. Inspired by this finding, we propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling. Experiments on four leading LLMs (Gemma2-9B-It, Llama3.1-8B-It, Qwen2.5-7B-It, Mistral-7B-It v0.2) show that our method effectively resists various jailbreak attacks while maintaining performance on benign tasks on AlpacaEval. Importantly, Attention Sharpening introduces no additional computational or memory overhead, making it an efficient and practical solution for real-world deployment.
中文摘要:本文揭示了越狱攻击中的普遍现象"注意力滑移"——模型在攻击过程中逐渐降低对危险请求的关注度,并提出直接对抗该现象的"注意力锐化"防御方法,该方法无需额外计算开销即可有效抵御各类越狱攻击。
English Summary: This paper identifies "Attention Slipping" as a universal mechanism in jailbreak attacks where models gradually reduce attention to unsafe requests, and proposes Attention Sharpening defense that directly counters this phenomenon without computational overhead.
Authors:Yizhou Wang, Song Mao, Yang Chen, Yufan Shen, Yinqiao Yan, Pinlong Cai, Ding Wang, Guohang Yan, Zhi Yu, Xuming Hu, Botian Shi
Abstract:
Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi encoder MLLMs, we find that performance typically degrades gracefully and sometimes even improves when selected encoders are masked, revealing pervasive encoder redundancy. To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoders marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model. Using these tools, we observe (i) strong specialization on tasks like OCR and Chart, where a single encoder can dominate with a CUR greater than 90%, (ii) high redundancy on general VQA and knowledge-based tasks, where encoders are largely interchangeable, (iii) instances of detrimental encoders with negative CUR. Notably, masking specific encoders can yield up to 16% higher accuracy on a specific task category and 3.6% overall performance boost compared to the full model.Furthermore, single and dual encoder variants recover over 90% of baseline on most non OCR tasks. Our analysis challenges the more encoders are better heuristic in MLLMs and provides actionable diagnostics for developing more efficient and effective multimodal architectures.
中文: 多模态大语言模型采用多个视觉编码器时普遍存在编码器冗余现象,新增编码器不仅收益递减甚至可能损害性能,本文提出的条件利用率与信息差距指标可有效量化这种冗余。
English: Multimodal Large Language Models using multiple vision encoders often suffer from encoder redundancy, where additional encoders provide diminishing returns or even degrade performance, as measured by the proposed Conditional Utilization Rate and Information Gap metrics.
Authors:Mohammad Ali Alomrani, Yingxue Zhang, Derek Li, Qianyi Sun, Soumyasundar Pal, Zhanguang Zhang, Yaochen Hu, Rohan Deepak Ajwani, Antonios Valkanas, Raika Karimi, Peng Cheng, Yunzhou Wang, Pengyi Liao, Hanrui Huang, Bin Wang, Jianye Hao, Mark Coates
Abstract:
Large language models (LLMs) have rapidly progressed into general-purpose agents capable of solving a broad spectrum of tasks. However, current models remain inefficient at reasoning: they apply fixed inference-time compute regardless of task complexity, often overthinking simple problems while underthinking hard ones. This survey presents a comprehensive review of efficient test-time compute (TTC) strategies, which aim to improve the computational efficiency of LLM reasoning. We introduce a two-tiered taxonomy that distinguishes between L1-controllability, methods that operate under fixed compute budgets, and L2-adaptiveness, methods that dynamically scale inference based on input difficulty or model confidence. We benchmark leading proprietary LLMs across diverse datasets, highlighting critical trade-offs between reasoning performance and token usage. Compared to prior surveys on efficient reasoning, our review emphasizes the practical control, adaptability, and scalability of TTC methods. Finally, we discuss emerging trends such as hybrid thinking models and identify key challenges for future work towards making LLMs more computationally efficient, robust, and responsive to user constraints.
中文: 本综述系统评述了大语言模型的高效测试时计算策略,提出固定预算与自适应方法的双层分类体系,通过性能与计算消耗的基准测试推动高效推理研究发展。
English: This survey comprehensively reviews efficient test-time compute strategies for large language models, introducing a taxonomy of fixed-budget and adaptive methods while benchmarking performance-computation tradeoffs to advance computationally efficient reasoning.
Authors:Bin Rao, Haicheng Liao, Yanchen Guan, Chengyue Wang, Bonan Wang, Jiaxun Zhang, Zhenning Li
Abstract:
Accurately predicting the future trajectories of traffic agents is essential in autonomous driving. However, due to the inherent imbalance in trajectory distributions, tail data in natural datasets often represents more complex and hazardous scenarios. Existing studies typically rely solely on a base model's prediction error, without considering the diversity and uncertainty of long-tail trajectory patterns. We propose an adaptive momentum and decoupled contrastive learning framework (AMD), which integrates unsupervised and supervised contrastive learning strategies. By leveraging an improved momentum contrast learning (MoCo-DT) and decoupled contrastive learning (DCL) module, our framework enhances the model's ability to recognize rare and complex trajectories. Additionally, we design four types of trajectory random augmentation methods and introduce an online iterative clustering strategy, allowing the model to dynamically update pseudo-labels and better adapt to the distributional shifts in long-tail data. We propose three different criteria to define long-tail trajectories and conduct extensive comparative experiments on the nuScenes and ETH$/$UCY datasets. The results show that AMD not only achieves optimal performance in long-tail trajectory prediction but also demonstrates outstanding overall prediction accuracy.
中文摘要:提出的AMD框架通过结合自适应动量和解耦对比学习,有效提升了自动驾驶中对长尾分布轨迹的预测能力,在nuScenes和ETH/UCY数据集上实现了对复杂场景和整体预测精度的最优化表现。
English Summary: The proposed AMD framework enhances autonomous driving trajectory prediction by integrating adaptive momentum and decoupled contrastive learning to better handle long-tail distribution challenges, achieving superior performance in both rare scenarios and overall accuracy.
Authors:Qingya Li, Shengcai Liu, Wenjie Chen, Juan Zou, Ke Tang, Xin Yao
Abstract:
The capacitated arc routing problem with time-dependent service costs (CARPTDSC) is a challenging combinatorial optimization problem that arises from winter gritting applications. CARPTDSC has two main challenges about time consumption. First, it is an NP-hard problem. Second, the time-dependent service costs of tasks require frequent evaluations during the search process, significantly increasing computational effort. These challenges make it difficult for existing algorithms to perform efficient searches, often resulting in limited efficiency. To address these issues, this paper proposes a knowledge-guided memetic algorithm with golden section search and negatively correlated search (KGMA-GN), where two knowledge-guided strategies are introduced to improve search efficiency. First, a knowledge-guided initialization strategy (KGIS) is proposed to generate high-quality initial solutions to speed up convergence. Second, a knowledge-guided small-step-size local search strategy (KGSLSS) is proposed to filter out invalid moves, thereby reducing unnecessary evaluations and saving the computation time. Experimental results on five benchmark test sets, including both small- and larger-scale instances, demonstrate that KGMA-GN achieves higher search efficiency than the state-of-the-art methods. Moreover, the ablation study further confirms that the knowledge-guided local search operators in KGSLSS can significantly reduce runtime compared to traditional operators, especially for the knowledge-guided swap operator, which achieves more than a tenfold improvement in speed.
中文: 本文提出了一种结合黄金分割搜索和负相关搜索的知识引导遗传算法(KGMA-GN),通过知识引导的初始化和局部搜索策略,有效解决了具有时间依赖服务成本的容量约束弧路径问题的高计算成本难题,在各类测试集上显著提升了搜索效率。
English: This paper introduces a knowledge-guided memetic algorithm with golden section search and negatively correlated search (KGMA-GN) to tackle the computationally intensive capacitated arc routing problem with time-dependent service costs, demonstrating superior efficiency over existing methods through strategic initialization and local search optimizations.
Authors:Ãzlem TuÄfe Demir, Muhammed Selman Somuncu, Ahmet M. Elbir, Emil Björnson
Abstract:
Cell-free massive MIMO is a key 6G technology, offering superior spectral and energy efficiency. However, its dense deployment of low-cost access points (APs) makes hardware impairments unavoidable. While narrowband impairments are well-studied, their impact in wideband systems remains unexplored. This paper provides the first comprehensive analysis of hardware impairments, such as nonlinear distortion in low-noise amplifiers, phase noise, in-phase-quadrature imbalance, and low-resolution analog-to-digital converters, on uplink spectral efficiency in cell-free massive MIMO. Using an OFDM waveform and centralized processing, APs share channel state information for joint uplink combining. Leveraging Bussgang decomposition, we derive a distortion-aware combining vector that optimizes spectral efficiency by modeling distortion as independent colored noise.
中文摘要:本文首次全面分析了无蜂窝大规模MIMO系统中多种硬件损伤对上行链路频谱效率的影响,提出了一种将失真建模为有色噪声的失真感知合并方法以优化系统性能。
English Summary: This paper presents the first comprehensive analysis of how various hardware impairments affect uplink spectral efficiency in cell-free massive MIMO systems, proposing a distortion-aware combining method that models distortion as colored noise to optimize performance.
Authors:Guangchen Lan, Sipeng Zhang, Tianle Wang, Yuwei Zhang, Daoan Zhang, Xinpeng Wei, Xiaoman Pan, Hongming Zhang, Dong-Jun Han, Christopher G. Brinton
Abstract:
As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.
中文: 本文提出最大后验偏好优化(MaPPO)框架,将先验奖励知识融入偏好学习,在泛化DPO等现有方法的同时,无需额外超参数即可在多种基准测试中持续提升对齐性能。
English: This paper introduces Maximum a Posteriori Preference Optimization (MaPPO), a framework that incorporates prior reward knowledge into preference learning, generalizing existing methods like DPO while enhancing alignment performance without additional hyperparameters across various benchmarks.
Authors:Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues
Abstract:
Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations -- hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities -- speech, vision, and text -- and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.
Chinese: 大型语言模型的最新进展催生了融合文本、语音和视觉的多模态LLM(MLLM),但现有基准在多语言、多模态及上下文长度方面评估不足,因此推出了MCIF这一多语言人工标注基准,用于评估MLLM在跨语言多模态环境中的指令遵循能力。
English: Recent advances in large language models have led to multimodal LLMs (MLLMs) that integrate text, speech, and vision, but existing benchmarks lack comprehensive evaluation across languages, modalities, and context lengths, prompting the introduction of MCIF, a multilingual human-annotated benchmark for assessing MLLMs' instruction-following abilities in crosslingual, multimodal settings.
Authors:Zhen Wan, Chao-Han Huck Yang, Yahan Yu, Jinchuan Tian, Sheng Li, Ke Hu, Zhehuai Chen, Shinji Watanabe, Fei Cheng, Chenhui Chu, Sadao Kurohashi
Abstract:
We introduce Speech-based Intelligence Quotient (SIQ) as a new form of human cognition-inspired evaluation pipeline for voice understanding large language models, LLM Voice, designed to assess their voice understanding ability. Moving beyond popular voice understanding metrics such as word error rate (WER), SIQ examines LLM Voice across three cognitive levels motivated by Bloom's Taxonomy: (1) Remembering (i.e., WER for verbatim accuracy); (2) Understanding (i.e., similarity of LLM's interpretations); and (3) Application (i.e., QA accuracy for simulating downstream tasks). We demonstrate that SIQ not only quantifies voice understanding abilities but also provides unified comparisons between cascaded methods (e.g., ASR LLM) and end-to-end models, identifies annotation errors in existing benchmarks, and detects hallucinations in LLM Voice. Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks, while exposing overlooked challenges in multi-modal training.
中文: 语音智商(SIQ)为语音理解大模型提出了基于认知科学的评估框架,通过布鲁姆分类法的三个层次进行测评,同时实现统一模型比较和错误检测功能。
English: The Speech-based Intelligence Quotient (SIQ) introduces a cognitive evaluation framework for voice-understanding LLMs, assessing them across three levels of Bloom's Taxonomy while enabling unified model comparisons and error detection.
Authors:Daniele Ghiani, Daniele Angioni, Giorgio Piras, Angelo Sotgiu, Luca Minnei, Srishti Gupta, Maura Pintor, Fabio Roli, Battista Biggio
Abstract:
Malware evolves rapidly, forcing machine learning (ML)-based detectors to adapt continuously. With antivirus vendors processing hundreds of thousands of new samples daily, datasets can grow to billions of examples, making full retraining impractical. Continual learning (CL) has emerged as a scalable alternative, enabling incremental updates without full data access while mitigating catastrophic forgetting. In this work, we analyze a critical yet overlooked issue in this context: security regression. Unlike forgetting, which manifests as a general performance drop on previously seen data, security regression captures harmful prediction changes at the sample level, such as a malware sample that was once correctly detected but evades detection after a model update. Although often overlooked, regressions pose serious risks in security-critical applications, as the silent reintroduction of previously detected threats in the system may undermine users' trust in the whole updating process. To address this issue, we formalize and quantify security regression in CL-based malware detectors and propose a regression-aware penalty to mitigate it. Specifically, we adapt Positive Congruent Training (PCT) to the CL setting, preserving prior predictive behavior in a model-agnostic manner. Experiments on the ELSA, Tesseract, and AZ-Class datasets show that our method effectively reduces regression across different CL scenarios while maintaining strong detection performance over time.
中文: 本研究针对持续学习恶意软件检测器中的安全回归问题,即模型更新导致已检测到的恶意软件重新逃避检测,提出了一种回归感知的惩罚方法,在保持检测性能的同时有效缓解了该问题。
English: The study addresses security regression in continual learning-based malware detectors, where model updates cause previously detected malware to evade detection, and proposes a regression-aware penalty method that effectively mitigates this issue while maintaining detection performance.
Authors:Ruizhe Chen, Zhiting Fan, Tianze Luo, Heqing Zou, Zhaopeng Feng, Guiyang Xie, Hansheng Zhang, Zhuochen Wang, Zuozhu Liu, Huaijian Zhang
Abstract:
Video Temporal Grounding (VTG) aims to localize relevant temporal segments in videos given natural language queries. Despite recent progress with large vision-language models (LVLMs) and instruction-tuning, existing approaches often suffer from limited temporal awareness and poor generalization. In this work, we introduce a two-stage training framework that integrates supervised fine-tuning with reinforcement learning (RL) to improve both the accuracy and robustness of VTG models. Our approach first leverages high-quality curated cold start data for SFT initialization, followed by difficulty-controlled RL to further enhance temporal localization and reasoning abilities. Comprehensive experiments on multiple VTG benchmarks demonstrate that our method consistently outperforms existing models, particularly in challenging and open-domain scenarios. We conduct an in-depth analysis of training strategies and dataset curation, highlighting the importance of both high-quality cold start data and difficulty-controlled RL. To facilitate further research and industrial adoption, we release all intermediate datasets, models, and code to the community.
中文: 本研究提出了一种结合监督微调与强化学习的两阶段训练框架,通过增强时序感知和泛化能力,在多个视频时序定位基准上实现了优越性能。
English: This study introduces a two-stage training framework combining supervised fine-tuning and reinforcement learning to enhance video temporal grounding models, achieving superior performance across benchmarks through improved temporal awareness and generalization.
Authors:Ko Watanabe, Stanislav Frolov, Adriano Lucieri, Andreas Dengel
Abstract:
Recent advancements in Deep Learning and its application on the edge hold great potential for the revolution of routine screenings for skin cancers like Melanoma. Along with the anticipated benefits of this technology, potential dangers arise from unforseen and inherent biases. Thus, assessing and improving the fairness of such systems is of utmost importance. A key challenge in fairness assessment is to ensure that the evaluation dataset is sufficiently representative of different Personal Identifiable Information (PII) (sex, age, and race) and other minority groups. Against the backdrop of this challenge, this study leverages the state-of-the-art Generative AI (GenAI) LightningDiT model to assess the fairness of publicly available melanoma classifiers. The results suggest that fairness assessment using highly realistic synthetic data is a promising direction. Yet, our findings indicate that verifying fairness becomes difficult when the melanoma-detection model used for evaluation is trained on data that differ from the dataset underpinning the synthetic images. Nonetheless, we propose that our approach offers a valuable new avenue for employing synthetic data to gauge and enhance fairness in medical-imaging GenAI systems.
中文: 深度学习在皮肤癌筛查中的最新进展引发了关于偏见的担忧,本研究利用生成式人工智能进行公平性评估,结果显示前景广阔,但在评估模型与合成图像数据源不同时面临挑战。
English: Recent advances in deep learning for skin cancer screening raise concerns about bias, prompting this study to use Generative AI for fairness assessment, which shows promise but faces challenges when evaluation models are trained on data differing from the synthetic image sources.
Authors:Yuqing Lan, Chenyang Zhu, Shuaifeng Zhi, Jiazhao Zhang, Zhoufeng Wang, Renjiao Yi, Yijie Wang, Kai Xu
Abstract:
The introduction of the neural implicit representation has notably propelled the advancement of online dense reconstruction techniques. Compared to traditional explicit representations, such as TSDF, it improves the mapping completeness and memory efficiency. However, the lack of reconstruction details and the time-consuming learning of neural representations hinder the widespread application of neural-based methods to large-scale online reconstruction. We introduce RemixFusion, a novel residual-based mixed representation for scene reconstruction and camera pose estimation dedicated to high-quality and large-scale online RGB-D reconstruction. In particular, we propose a residual-based map representation comprised of an explicit coarse TSDF grid and an implicit neural module that produces residuals representing fine-grained details to be added to the coarse grid. Such mixed representation allows for detail-rich reconstruction with bounded time and memory budget, contrasting with the overly-smoothed results by the purely implicit representations, thus paving the way for high-quality camera tracking. Furthermore, we extend the residual-based representation to handle multi-frame joint pose optimization via bundle adjustment (BA). In contrast to the existing methods, which optimize poses directly, we opt to optimize pose changes. Combined with a novel technique for adaptive gradient amplification, our method attains better optimization convergence and global optimality. Furthermore, we adopt a local moving volume to factorize the mixed scene representation with a divide-and-conquer design to facilitate efficient online learning in our residual-based framework. Extensive experiments demonstrate that our method surpasses all state-of-the-art ones, including those based either on explicit or implicit representations, in terms of the accuracy of both mapping and tracking on large-scale scenes.
Chinese: RemixFusion提出了一种基于残差的混合表示法,结合显式TSDF网格与隐式神经模块,实现了高质量、大规模的在线RGB-D重建和相机姿态估计,在精度和效率上均超越了现有最优方法。
English: RemixFusion introduces a residual-based mixed representation combining explicit TSDF grids with implicit neural modules to achieve high-quality, large-scale online RGB-D reconstruction and camera pose estimation, surpassing state-of-the-art methods in accuracy and efficiency.
Authors:Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, Lijun Wu
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathematical problem-solving, coding tasks, or logical reasoning. However, real world reasoning scenarios inherently demand an integrated application of multiple cognitive skills. Despite this, the interplay among these reasoning skills under reinforcement learning remains poorly understood. To bridge this gap, we present a systematic investigation of multi-domain reasoning within the RLVR framework, explicitly focusing on three primary domains: mathematical reasoning, code generation, and logical puzzle solving. We conduct a comprehensive study comprising four key components: (1) Leveraging the GRPO algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the models' in-domain improvements and cross-domain generalization capabilities when trained on single-domain datasets. (2) Additionally, we examine the intricate interactions including mutual enhancements and conflicts that emerge during combined cross-domain training. (3) To further understand the influence of SFT on RL, we also analyze and compare performance differences between base and instruct models under identical RL configurations. (4) Furthermore, we delve into critical RL training details, systematically exploring the impacts of curriculum learning strategies, variations in reward design, and language-specific factors. Through extensive experiments, our results offer significant insights into the dynamics governing domain interactions, revealing key factors influencing both specialized and generalizable reasoning performance. These findings provide valuable guidance for optimizing RL methodologies to foster comprehensive, multi-domain reasoning capabilities in LLMs.
中文: 本研究系统探索了可验证奖励强化学习框架下的多领域推理,揭示了数学、编程和逻辑领域间的关键相互作用与优化策略,为提升大语言模型的专项与泛化推理能力提供了重要指导。
English: This study systematically investigates multi-domain reasoning within the Reinforcement Learning with Verifiable Rewards framework, revealing key interactions and optimization strategies across mathematical, coding, and logical domains to enhance both specialized and generalizable reasoning in LLMs.
Authors:Xi Xiao, Yunbei Zhang, Janet Wang, Lin Zhao, Yuxiang Wei, Hengjia Li, Yanshu Li, Xiao Wang, Swalpa Kumar Roy, Hao Xu, Tianyang Wang
Abstract:
Accurate road damage detection is crucial for timely infrastructure maintenance and public safety, but existing vision-only datasets and models lack the rich contextual understanding that textual information can provide. To address this limitation, we introduce RoadBench, the first multimodal benchmark for comprehensive road damage understanding. This dataset pairs high resolution images of road damages with detailed textual descriptions, providing a richer context for model training. We also present RoadCLIP, a novel vision language model that builds upon CLIP by integrating domain specific enhancements. It includes a disease aware positional encoding that captures spatial patterns of road defects and a mechanism for injecting road-condition priors to refine the model's understanding of road damages. We further employ a GPT driven data generation pipeline to expand the image to text pairs in RoadBench, greatly increasing data diversity without exhaustive manual annotation. Experiments demonstrate that RoadCLIP achieves state of the art performance on road damage recognition tasks, significantly outperforming existing vision-only models by 19.2%. These results highlight the advantages of integrating visual and textual information for enhanced road condition analysis, setting new benchmarks for the field and paving the way for more effective infrastructure monitoring through multimodal learning.
中文摘要:本研究提出了首个多模态道路损伤基准数据集RoadBench和基于CLIP改进的RoadCLIP模型,通过融合空间编码与道路状况先验知识,在道路损伤识别任务中以19.2%的优势显著超越纯视觉模型,开创了多模态基础设施监测新范式。
English Summary: The study introduces RoadBench, a multimodal dataset combining road damage images with textual descriptions, and RoadCLIP, an enhanced vision-language model that achieves state-of-the-art performance by integrating spatial encodings and road-condition priors, outperforming vision-only models by 19.2%.
Authors:Shanghai AI Lab, :, Xiaoyang Chen, Yunhao Chen, Zeren Chen, Zhiyun Chen, Hanyun Cui, Yawen Duan, Jiaxuan Guo, Qi Guo, Xuhao Hu, Hong Huang, Lige Huang, Chunxiao Li, Juncheng Li, Qihao Lin, Dongrui Liu, Xinmin Liu, Zicheng Liu, Chaochao Lu, Xiaoya Lu, Jingjing Qu, Qibing Ren, Jing Shao, Jingwei Shi, Jingwei Sun, Peng Wang, Weibing Wang, Jia Xu, Lewen Yan, Xiao Yu, Yi Yu, Boxuan Zhang, Jie Zhang, Weichen Zhang, Zhijie Zheng, Tianyi Zhou, Bowen Zhou
Abstract:
To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R\&D, strategic deception and scheming, self-replication, and collusion. Guided by the "AI-$45^\circ$ Law," we evaluate these risks using "red lines" (intolerable thresholds) and "yellow lines" (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R\&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
中文: 本报告采用E-T-C框架和AI-45°定律评估前沿人工智能风险,发现当前模型主要处于可控的绿色和黄色区域,未突破关键红线,同时呼吁采取集体行动应对说服力和生物风险等潜在威胁。
English: This report assesses frontier AI risks using the E-T-C framework and AI-45° Law, finding current models primarily in manageable green and yellow zones without crossing critical red lines, while urging collective action to address potential threats in areas like persuasion and biological risks.
Authors:Penghui Yang, Chendong Zhao, Bijun Tang, Zhonghan Zhang, Xinrun Wang, Yanchen Deng, Yuhao Lu, Cuntai Guan, Zheng Liu, Bo An
Abstract:
Alloy discovery is central to advancing modern industry but remains hindered by the vastness of compositional design space and the costly validation. Here, we present AutoMAT, a hierarchical and autonomous framework grounded in and validated by experiments, which integrates large language models, automated CALPHAD-based simulations, and AI-driven search to accelerate alloy design. Spanning the entire pipeline from ideation to validation, AutoMAT achieves high efficiency, accuracy, and interpretability without the need for manually curated large datasets. In a case study targeting a lightweight, high-strength alloy, AutoMAT identifies a titanium alloy with 8.1% lower density and comparable yield strength relative to the state-of-the-art reference, achieving the highest specific strength among all comparisons. In a second case targeting high-yield-strength high-entropy alloys, AutoMAT achieves a 28.2% improvement in yield strength over the base alloy. In both cases, AutoMAT reduces the discovery timeline from years to weeks, illustrating its potential as a scalable and versatile platform for next-generation alloy design.
中文:AutoMAT是一种创新的分层框架,融合人工智能技术与自动化模拟,能够快速发现高性能合金,将研发周期从数年缩短至数周,同时实现卓越的材料性能。
English: AutoMAT is an innovative hierarchical framework that integrates AI technologies and automated simulations to rapidly discover high-performance alloys, significantly reducing development time from years to weeks while achieving superior material properties.
Authors:Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, Dawn Song
Abstract:
Despite their potential, recent research has demonstrated that LLM agents are vulnerable to prompt injection attacks, where malicious prompts are injected into the agent's input, causing it to perform an attacker-specified task rather than the intended task provided by the user. In this paper, we present PromptArmor, a simple yet effective defense against prompt injection attacks. Specifically, PromptArmor prompts an off-the-shelf LLM to detect and remove potential injected prompts from the input before the agent processes it. Our results show that PromptArmor can accurately identify and remove injected prompts. For example, using GPT-4o, GPT-4.1, or o4-mini, PromptArmor achieves both a false positive rate and a false negative rate below 1% on the AgentDojo benchmark. Moreover, after removing injected prompts with PromptArmor, the attack success rate drops to below 1%. We also demonstrate PromptArmor's effectiveness against adaptive attacks and explore different strategies for prompting an LLM. We recommend that PromptArmor be adopted as a standard baseline for evaluating new defenses against prompt injection attacks.
Chinese: 最新研究表明,LLM智能体易受提示注入攻击,而PromptArmor通过调用现成LLM检测并清除输入中的恶意提示,有效防御此类攻击,将攻击成功率降至1%以下,且误报率和漏报率极低。
English: Recent research reveals that LLM agents are susceptible to prompt injection attacks, but PromptArmor effectively defends against them by using an off-the-shelf LLM to detect and remove malicious prompts, reducing attack success rates to below 1% with minimal false positives and negatives.
Authors:Xi Xiao, Aristeidis Tsaris, Anika Tabassum, John Lagergren, Larry M. York, Tianyang Wang, Xiao Wang
Abstract:
Hyperspectral imaging (HSI) captures hundreds of narrow, contiguous wavelength bands, making it a powerful tool in biology, agriculture, and environmental monitoring. However, interpreting Vision Transformers (ViTs) in this setting remains largely unexplored due to two key challenges: (1) existing saliency methods struggle to capture meaningful spectral cues, often collapsing attention onto the class token, and (2) full-spectrum ViTs are computationally prohibitive for interpretability, given the high-dimensional nature of HSI data. We present FOCUS, the first framework that enables reliable and efficient spatial-spectral interpretability for frozen ViTs. FOCUS introduces two core components: class-specific spectral prompts that guide attention toward semantically meaningful wavelength groups, and a learnable [SINK] token trained with an attraction loss to absorb noisy or redundant attention. Together, these designs make it possible to generate stable and interpretable 3D saliency maps and spectral importance curves in a single forward pass, without any gradient backpropagation or backbone modification. FOCUS improves band-level IoU by 15 percent, reduces attention collapse by over 40 percent, and produces saliency results that align closely with expert annotations. With less than 1 percent parameter overhead, our method makes high-resolution ViT interpretability practical for real-world hyperspectral applications, bridging a long-standing gap between black-box modeling and trustworthy HSI decision-making.
Chinese: FOCUS是一种创新框架,通过引入类别特定光谱提示和可学习的[SINK]标记,无需梯度反向传播即可为高光谱成像中的视觉变换器生成稳定的显著性图谱,实现了高效可靠的空谱可解释性。
English: FOCUS is a novel framework that enables efficient and reliable spatial-spectral interpretability for Vision Transformers in hyperspectral imaging by using class-specific spectral prompts and a learnable [SINK] token to generate stable saliency maps without gradient backpropagation.
Authors:Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang, Xiyou Zhou, Jun Qin, Dian Ang Yap, Narendran Raghavan, Xuankai Chang, Margit Bowler, Eray Yildiz, John Peebles, Hannah Gillis Coleman, Matteo Ronchi, Peter Gray, Keen You, Anthony Spalvieri-Kruse, Ruoming Pang, Reed Li, Yuli Yang, Emad Soroush, Zhiyun Lu, Crystal Xiao, Rong Situ, Jordan Huffaker, David Griffiths, Zaid Ahmed, Peng Zhang, Daniel Parilla, Asaf Liberman, Jennifer Mallalieu, Parsa Mazaheri, Qibin Chen, Manjot Bilkhu, Aonan Zhang, Eric Wang, Dave Nelson, Michael FitzMaurice, Thomas Voice, Jeremy Liu, Josh Shaffer, Shiwen Zhao, Prasanth Yadla, Farzin Rasteh, Pengsheng Guo, Arsalan Farooq, Jeremy Snow, Stephen Murphy, Tao Lei, Minsik Cho, George Horrell, Sam Dodge, Lindsay Hislop, Sumeet Singh, Alex Dombrowski, Aiswarya Raghavan, Sasha Sirovica, Mandana Saebi, Faye Lao, Max Lam, TJ Lu, Zhaoyang Xu, Karanjeet Singh, Marc Kirchner, David Mizrahi, Rajat Arora, Haotian Zhang, Henry Mason, Lawrence Zhou, Yi Hua, Ankur Jain, Felix Bai, Joseph Astrauskas, Floris Weers, Josh Gardner, Mira Chiang, Yi Zhang, Pulkit Agrawal, Tony Sun, Quentin Keunebroek, Matthew Hopkins, Bugu Wu, Tao Jia, Chen Chen, Xingyu Zhou, Nanzhu Wang, Peng Liu, Ruixuan Hou, Rene Rauch, Yuan Gao, Afshin Dehghan, Jonathan Janke, Zirui Wang, Cha Chen, Xiaoyi Ren, Feng Nan, Josh Elman, Dong Yin, Yusuf Goren, Jeff Lai, Yiran Fei, Syd Evans, Muyang Yu, Guoli Yin, Yi Qin, Erin Feldman, Isha Garg, Aparna Rajamani, Karla Vega, Walker Cheng, TJ Collins, Hans Han, Raul Rea Menacho, Simon Yeung, Sophy Lee, Phani Mutyala, Ying-Chang Cheng, Zhe Gan, Sprite Chu, Justin Lazarow, Alessandro Pappalardo, Federico Scozzafava, Jing Lu, Erik Daxberger, Laurent Duchesne, Jen Liu, David Güera, Stefano Ligas, Mary Beth Kery, Brent Ramerth, Ciro Sannino, Marcin Eichner, Haoshuo Huang, Rui Qian, Moritz Schwarzer-Becker, David Riazati, Mingfei Gao, Bailin Wang, Jack Cackler, Yang Lu, Ransen Niu, John Dennison, Guillaume Klein, Jeffrey Bigham, Deepak Gopinath, Navid Shiee, Darren Botten, Guillaume Tartavel, Alex Guillen Garcia, Sam Xu, Victoria MönchJuan Haladjian, Zi-Yi Dou, Matthias Paulik, Adolfo Lopez Mendez, Zhen Li, Hong-You Chen, Chao Jia, Dhaval Doshi, Zhengdong Zhang, Raunak Manjani, Aaron Franklin, Zhile Ren, David Chen, Artsiom Peshko, Nandhitha Raghuram, Hans Hao, Jiulong Shan, Kavya Nerella, Ramsey Tantawi, Vivek Kumar, Saiwen Wang, Brycen Wershing, Bhuwan Dhingra, Dhruti Shah, Ob Adaranijo, Xin Zheng, Tait Madsen, Hadas Kotek, Chang Liu, Yin Xia, Hanli Li, Suma Jayaram, Yanchao Sun, Ahmed Fakhry, Vasileios Saveris, Dustin Withers, Yanghao Li, Alp Aygar, Andres Romero Mier Y Teran, Kaiwei Huang, Mark Lee, Xiujun Li, Yuhong Li, Tyler Johnson, Jay Tang, Joseph Yitan Cheng, Futang Peng, Andrew Walkingshaw, Lucas Guibert, Abhishek Sharma, Cheng Shen, Piotr Maj, Yasutaka Tanaka, You-Cyuan Jhang, Vivian Ma, Tommi Vehvilainen, Kelvin Zou, Jeff Nichols, Matthew Lei, David Qiu, Yihao Qian, Gokul Santhanam, Wentao Wu, Yena Han, Dominik Moritz, Haijing Fu, Mingze Xu, Vivek Rathod, Jian Liu, Louis D'hauwe, Qin Ba, Haitian Sun, Haoran Yan, Philipp Dufter, Anh Nguyen, Yihao Feng, Emma Wang, Keyu He, Rahul Nair, Sanskruti Shah, Jiarui Lu, Patrick Sonnenberg, Jeremy Warner, Yuanzhi Li, Bowen Pan, Ziyi Zhong, Joe Zhou, Sam Davarnia, Olli Saarikivi, Irina Belousova, Rachel Burger, Shang-Chen Wu, Di Feng, Bas Straathof, James Chou, Yuanyang Zhang, Marco Zuliani, Eduardo Jimenez, Abhishek Sundararajan, Xianzhi Du, Chang Lan, Nilesh Shahdadpuri, Peter Grasch, Sergiu Sima, Josh Newnham, Varsha Paidi, Jianyu Wang, Kaelen Haag, Alex Braunstein, Daniele Molinari, Richard Wei, Brenda Yang, Nicholas Lusskin, Joanna Arreaza-Taylor, Meng Cao, Nicholas Seidl, Simon Wang, Jiaming Hu, Yiping Ma, Mengyu Li, Kieran Liu, Hang Su, Sachin Ravi, Chong Wang, Xin Wang, Kevin Smith, Haoxuan You, Binazir Karimzadeh, Rui Li, Jinhao Lei, Wei Fang, Alec Doane, Sam Wiseman, Ismael Fernandez, Jane Li, Andrew Hansen, Javier Movellan, Christopher Neubauer, Hanzhi Zhou, Chris Chaney, Nazir Kamaldin, Valentin Wolf, Fernando Bermúdez-Medina, Joris Pelemans, Peter Fu, Howard Xing, Xiang Kong, Wayne Shan, Gabriel Jacoby-Cooper, Dongcai Shen, Tom Gunter, Guillaume Seguin, Fangping Shi, Shiyu Li, Yang Xu, Areeba Kamal, Dan Masi, Saptarshi Guha, Qi Zhu, Jenna Thibodeau, Changyuan Zhang, Rebecca Callahan, Charles Maalouf, Wilson Tsao, Boyue Li, Qingqing Cao, Naomy Sabo, Cheng Leong, Yi Wang, Anupama Mann Anupama, Colorado Reed, Kenneth Jung, Zhifeng Chen, Mohana Prasad Sathya Moorthy, Yifei He, Erik Hornberger, Devi Krishna, Senyu Tong, Michael, Lee, David Haldimann, Yang Zhao, Bowen Zhang, Chang Gao, Chris Bartels, Sushma Rao, Nathalie Tran, Simon Lehnerer, Co Giang, Patrick Dong, Junting Pan, Biyao Wang, Dongxu Li, Mehrdad Farajtabar, Dongseong Hwang, Grace Duanmu, Eshan Verma, Sujeeth Reddy, Qi Shan, Hongbin Gao, Nan Du, Pragnya Sridhar, Forrest Huang, Yingbo Wang, Nikhil Bhendawade, Diane Zhu, Sai Aitharaju, Fred Hohman, Lauren Gardiner, Chung-Cheng Chiu, Yinfei Yang, Alper Kokmen, Frank Chu, Ke Ye, Kaan Elgin, Oron Levy, John Park, Donald Zhang, Eldon Schoop, Nina Wenzel, Michael Booker, Hyunjik Kim, Chinguun Erdenebileg, Nan Dun, Eric Liang Yang, Priyal Chhatrapati, Vishaal Mahtani, Haiming Gang, Kohen Chia, Deepa Seshadri, Donghan Yu, Yan Meng, Kelsey Peterson, Zhen Yang, Yongqiang Wang, Carina Peng, Doug Kang, Anuva Agarwal, Albert Antony, Juan Lao Tebar, Albin Madappally Jose, Regan Poston, Andy De Wang, Gerard Casamayor, Elmira Amirloo, Violet Yao, Wojciech Kryscinski, Kun Duan, Lezhi L
Abstract:
We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines.
A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.
Chinese: 苹果推出了两个多语言多模态基础模型——一个30亿参数的设备端模型和一个可扩展的服务器模型,在公共基准和人工评估中均优于同类开放基准,同时支持更多语言、图像理解和工具调用,并基于负责任的人工智能方法和用户隐私保护构建。
English: Apple has developed two multilingual and multimodal foundation models—a 3B-parameter on-device model and a scalable server model—that outperform comparable open benchmarks while supporting additional languages, image understanding, and tool calls, all built with a focus on responsible AI and user privacy.
Authors:Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, Xiaolong Wang
Abstract:
Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA
Chinese: 本文提出EgoVLA模型,通过人类视频训练视觉-语言-动作模型预测动作,再经逆运动学转换为机器人动作,并利用少量机器人演示进行微调,显著提升了操作性能。
English: This paper introduces EgoVLA, a Vision-Language-Action model trained on human videos to predict actions, which are then converted to robot actions through inverse kinematics and fine-tuned with minimal robot demonstrations for improved manipulation performance.
Authors:Hao Chen, Takuya Kiyokawa, Zhengtao Hu, Weiwei Wan, Kensuke Harada
Abstract:
Grasping unknown objects from a single view has remained a challenging topic in robotics due to the uncertainty of partial observation. Recent advances in large-scale models have led to benchmark solutions such as GraspNet-1Billion. However, such learning-based approaches still face a critical limitation in performance robustness for their sensitivity to sensing noise and environmental changes. To address this bottleneck in achieving highly generalized grasping, we abandon the traditional learning framework and introduce a new perspective: similarity matching, where similar known objects are utilized to guide the grasping of unknown target objects. We newly propose a method that robustly achieves unknown-object grasping from a single viewpoint through three key steps: 1) Leverage the visual features of the observed object to perform similarity matching with an existing database containing various object models, identifying potential candidates with high similarity; 2) Use the candidate models with pre-existing grasping knowledge to plan imitative grasps for the unknown target object; 3) Optimize the grasp quality through a local fine-tuning process. To address the uncertainty caused by partial and noisy observation, we propose a multi-level similarity matching framework that integrates semantic, geometric, and dimensional features for comprehensive evaluation. Especially, we introduce a novel point cloud geometric descriptor, the C-FPFH descriptor, which facilitates accurate similarity assessment between partial point clouds of observed objects and complete point clouds of database models. In addition, we incorporate the use of large language models, introduce the semi-oriented bounding box, and develop a novel point cloud registration approach based on plane detection to enhance matching accuracy under single-view conditions. Videos are available at https://youtu.be/qQDIELMhQmk.
中文: 本文提出一种基于相似性匹配的方法,通过利用已知物体模型进行模仿抓取规划与优化,结合多层级特征匹配与新型点云描述符,有效解决了单视角下未知物体抓取的鲁棒性问题。
English: This paper introduces a similarity matching approach to robustly grasp unknown objects from single-view observations by leveraging known object models for imitative grasp planning and optimization, addressing limitations of learning-based methods through multi-level feature integration and novel descriptors.
Authors:Anna Lunghi, Matteo Castiglioni, Alberto Marchesi
Abstract:
Bilateral trade is a central problem in algorithmic economics, and recent work has explored how to design trading mechanisms using no-regret learning algorithms. However, no-regret learning is impossible when budget balance has to be enforced at each time step. Bernasconi et al. [Ber+24] show how this impossibility can be circumvented by relaxing the budget balance constraint to hold only globally over all time steps. In particular, they design an algorithm achieving regret of the order of $\tilde O(T^{3/4})$ and provide a lower bound of $Ω(T^{5/7})$.
In this work, we interpolate between these two extremes by studying how the optimal regret rate varies with the allowed violation of the global budget balance constraint. Specifically, we design an algorithm that, by violating the constraint by at most $T^β$ for any given $β\in [\frac{3}{4}, \frac{6}{7}]$, attains regret $\tilde O(T^{1 - β/3})$. We complement this result with a matching lower bound, thus fully characterizing the trade-off between regret and budget violation. Our results show that both the $\tilde O(T^{3/4})$ upper bound in the global budget balance case and the $Ω(T^{5/7})$ lower bound under unconstrained budget balance violation obtained by Bernasconi et al. [Ber+24] are tight.
中文: 本研究探讨了双边贸易机制中遗憾与预算平衡违反之间的权衡,提出一种算法通过允许受控的预算违反达到最优遗憾率,并通过匹配下界证实了先前结果的紧性。
English: This study explores the trade-off between regret and budget balance violation in bilateral trade mechanisms, presenting an algorithm that achieves optimal regret rates by allowing controlled budget violations, with matching lower bounds confirming the tightness of prior results.
Authors:Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu
Abstract:
Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key insights emerge from our analysis: (1) the "overthinking trap" is a critical factor contributing to the performance degradation; (2) the models trained with "long2short" technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation. Code and results are available at https://opendatalab.github.io/REST.
中文:REST框架通过同时压力测试揭示大型推理模型性能显著下降,比传统基准更具区分度,同时减少对人类标注的依赖。
English: The REST framework introduces simultaneous stress testing for Large Reasoning Models, revealing significant performance degradation and stronger discriminative power than traditional benchmarks, while reducing reliance on human annotation.
Authors:Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, Xiaojuan Qi
Abstract:
Leveraging the powerful representations of pre-trained vision foundation models -- traditionally used for visual comprehension -- we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.
中文: 本研究提出VFMTok图像分词器,基于预训练的冻结视觉基础模型,通过区域自适应量化和语义重建目标,显著提升了图像重建与生成质量及分词效率,在ImageNet上实现2.07的gFID指标,并将模型收敛速度加快三倍。
English: This study introduces VFMTok, an image tokenizer built on a frozen pre-trained vision foundation model, incorporating region-adaptive quantization and a semantic reconstruction objective to significantly enhance image reconstruction, generation quality, and token efficiency, achieving a gFID of 2.07 on ImageNet and accelerating model convergence by three times.
Authors:Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, Liqiang Nie
Abstract:
As multimedia content expands, the demand for unified multimodal retrieval (UMR) in real-world applications increases. Recent work leverages multimodal large language models (MLLMs) to tackle this task. However, their large parameter size results in high training costs and low inference efficiency. To address this, we propose PUMA: a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning. Our approach improves UMR from both structural and learning perspectives. (1) Structurally, we propose Layer-Pruned Self-Distillation, which prunes MLLMs by keeping only shallow layers while distilling features from dropped deep layers as teacher signals. This reduces parameters and preserves representation capability. (2) On the learning side, we introduce Modality-Adaptive Contrastive Learning Loss (MAC-Loss), which separates in-batch negatives into harder intra-modality and easier inter-modality groups based on the target modality, assigning different temperature strategies to enhance learning efficiency. Experiments show our method significantly reduces resource usage while maintaining strong performance.
中文: 针对多模态大语言模型在统一检索中的低效问题,我们提出PUMA框架,通过层级剪枝自蒸馏减少参数量,并采用模态自适应对比学习提升训练效果。
English: To address the inefficiency of multimodal large language models in unified retrieval, we propose PUMA, which employs layer-pruned self-distillation to reduce parameters and modality-adaptive contrastive learning to enhance training effectiveness.
Authors:Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, Min Xu
Abstract:
Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers, with conventional approaches utilizing dataset-level prompts that remain the same across all input instances. We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information. Moreover, we reveal that VPT-Deep and VPT-Shallow represent two corner cases based on a conceptual understanding, in which they fail to effectively capture instance-specific information, while random dimension reduction on prompts only yields performance between the two extremes. Instead, ViaPT overcomes these limitations by balancing dataset-level and instance-level knowledge, while reducing the amount of learnable parameters compared to VPT-Deep. Extensive experiments across 34 diverse datasets demonstrate that our method consistently outperforms state-of-the-art baselines, establishing a new paradigm for analyzing and optimizing visual prompts for vision transformers.
中文: 视觉实例感知提示调优(ViaPT)通过主成分分析生成结合数据集级提示的实例感知提示,在34个数据集上超越现有方法并减少参数量。
English: Visual Instance-aware Prompt Tuning (ViaPT) introduces instance-specific prompts combined with dataset-level prompts using PCA, outperforming existing methods across 34 datasets while reducing parameters.
Authors:Siting Wang, Minnan Pei, Luoyang Sun, Cheng Deng, Kun Shao, Zheng Tian, Haifeng Zhang, Jun Wang
Abstract:
Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark's strong discriminative power, but also uncovers counter-intuitive findings: models show difficulty perception misaligned with human intuition, exhibit dramatic 2Dto-3D performance cliffs, default to formulaic derivation over visualization, and paradoxically suffer performance degradation from Chain-of-Thought prompting in open-source models. Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark data and evaluation code are publicly available.
Chinese: 本文提出了SpatialViz-Bench这一评估多模态大模型空间可视化能力的综合基准,发现即使是最先进模型仍存在明显缺陷和反直觉行为,填补了该领域的重要空白。
English: This paper introduces SpatialViz-Bench, a comprehensive benchmark for evaluating spatial visualization in multi-modal LLMs, revealing significant performance gaps and counter-intuitive model behaviors despite current advancements.
Authors:Kaiqu Liang, Haimin Hu, Xuandong Zhao, Dawn Song, Thomas L. Griffiths, Jaime Fernández Fisac
Abstract:
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and sycophancy, we propose machine bullshit as an overarching conceptual framework that can allow researchers to characterize the broader phenomenon of emergent loss of truthfulness in LLMs and shed light on its underlying mechanisms. We introduce the Bullshit Index, a novel metric quantifying LLMs' indifference to truth, and propose a complementary taxonomy analyzing four qualitative forms of bullshit: empty rhetoric, paltering, weasel words, and unverified claims. We conduct empirical evaluations on the Marketplace dataset, the Political Neutrality dataset, and our new BullshitEval benchmark (2,400 scenarios spanning 100 AI assistants) explicitly designed to evaluate machine bullshit. Our results demonstrate that model fine-tuning with reinforcement learning from human feedback (RLHF) significantly exacerbates bullshit and inference-time chain-of-thought (CoT) prompting notably amplify specific bullshit forms, particularly empty rhetoric and paltering. We also observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy. Our findings highlight systematic challenges in AI alignment and provide new insights toward more truthful LLM behavior.
中文摘要:本研究提出“机器胡诌”作为分析大语言模型忽视真理的总体框架,通过实证评估证明人类反馈强化学习微调和思维链提示会加剧此类行为,尤其在政治语境中最为显著。
English Summary: This study introduces "machine bullshit" as a conceptual framework to analyze LLMs' indifference to truth, proposing a Bullshit Index and taxonomy while demonstrating through empirical evaluations that RLHF fine-tuning and CoT prompting exacerbate such behaviors, particularly in political contexts.
Authors:Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Jiageng Li, Yitian Hong, Xinrun Wang, Bo An
Abstract:
Autonomous agents navigating human society must master both production activities and social interactions, yet existing benchmarks rarely evaluate these skills simultaneously. To bridge this gap, we introduce StarDojo, a novel benchmark based on Stardew Valley, designed to assess AI agents in open-ended production-living simulations. In StarDojo, agents are tasked to perform essential livelihood activities such as farming and crafting, while simultaneously engaging in social interactions to establish relationships within a vibrant community. StarDojo features 1,000 meticulously curated tasks across five key domains: farming, crafting, exploration, combat, and social interactions. Additionally, we provide a compact subset of 100 representative tasks for efficient model evaluation. The benchmark offers a unified, user-friendly interface that eliminates the need for keyboard and mouse control, supports all major operating systems, and enables the parallel execution of multiple environment instances, making it particularly well-suited for evaluating the most capable foundation agents, powered by multimodal large language models (MLLMs). Extensive evaluations of state-of-the-art MLLMs agents demonstrate substantial limitations, with the best-performing model, GPT-4.1, achieving only a 12.7% success rate, primarily due to challenges in visual understanding, multimodal reasoning and low-level manipulation. As a user-friendly environment and benchmark, StarDojo aims to facilitate further research towards robust, open-ended agents in complex production-living environments.
中文: StarDojo是基于《星露谷物语》的新基准,旨在评估AI代理在开放式生产生活模拟中的生产活动与社交互动能力,结果显示当前最优模型GPT-4.1成功率仅为12.7%,突显其在视觉理解和多模态推理方面的重大局限。
English: StarDojo is a new benchmark based on Stardew Valley that evaluates AI agents' abilities in both production activities and social interactions within open-ended simulations, revealing significant limitations in current models like GPT-4.1, which achieved only a 12.7% success rate.
Authors:Bruce Coburn, Jiangpeng He, Megan E. Rollo, Satvinder S. Dhaliwal, Deborah A. Kerr, Fengqing Zhu
Abstract:
Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce ACETADA, a new food-image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open-weight and four closed-weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context-aware LMMs for improved nutrition analysis.
中文: 本研究通过新发布的ACETADA数据集验证,整合位置与用餐时间等上下文元数据及推理调节器,能显著提升大型多模态模型在营养分析中的预测准确性。
English: This study demonstrates that integrating contextual metadata such as location and meal timing with reasoning modifiers significantly enhances Large Multimodal Models' accuracy in nutritional analysis, as validated by the newly introduced ACETADA dataset.
Authors:Bruce Coburn, Jiangpeng He, Megan E. Rollo, Satvinder S. Dhaliwal, Deborah A. Kerr, Fengqing Zhu
Abstract:
Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce \textbf{ACETADA}, a new food-image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open-weight and four closed-weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context-aware LMMs for improved nutrition analysis.
中文: 本研究通过新发布的ACETADA数据集验证,整合位置与用餐时间等上下文元数据及推理调节器,能显著提升大型多模态模型在营养分析中的预测准确性。
English: This study demonstrates that integrating contextual metadata such as location and meal timing with reasoning modifiers significantly enhances Large Multimodal Models' accuracy in nutritional analysis, as validated by the newly introduced ACETADA dataset.
Authors:Yaoqi Huang, Julie Stephany Berrio, Mao Shan, Stewart Worrall
Abstract:
Advances in vision-based sensors and computer vision algorithms have significantly improved the analysis and understanding of traffic scenarios. To facilitate the use of these improvements for road safety, this survey systematically categorizes the critical elements that demand attention in traffic scenarios and comprehensively analyzes available vision-driven tasks and datasets. Compared to existing surveys that focus on isolated domains, our taxonomy categorizes attention-worthy traffic entities into two main groups that are anomalies and normal but critical entities, integrating ten categories and twenty subclasses. It establishes connections between inherently related fields and provides a unified analytical framework. Our survey highlights the analysis of 35 vision-driven tasks and comprehensive examinations and visualizations of 73 available datasets based on the proposed taxonomy. The cross-domain investigation covers the pros and cons of each benchmark with the aim of providing information on standards unification and resource optimization. Our article concludes with a systematic discussion of the existing weaknesses, underlining the potential effects and promising solutions from various perspectives. The integrated taxonomy, comprehensive analysis, and recapitulatory tables serve as valuable contributions to this rapidly evolving field by providing researchers with a holistic overview, guiding strategic resource selection, and highlighting critical research gaps.
中文摘要:本综述通过将需关注的交通实体分为异常与关键常规两类,提出统一分析框架,系统评估了35项视觉任务和73个数据集,为领域研究提供整体视角并指明资源优化方向。
English Summary: This survey introduces a unified taxonomy for traffic scenario analysis by categorizing attention-worthy entities into anomalies and critical normal elements, while comprehensively evaluating 35 vision tasks and 73 datasets to bridge research gaps and guide resource optimization.
Authors:Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal
Abstract:
Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.
中文: Video-RTS提出了一种结合数据高效强化学习与自适应测试时缩放的新方法,无需监督微调即可用极少量训练样本实现更优的视频推理性能。
English: Video-RTS introduces a data-efficient reinforcement learning method combined with adaptive test-time scaling, achieving superior video reasoning accuracy with minimal training samples while eliminating resource-intensive supervised fine-tuning.
Authors:Alexander Xiong, Xuandong Zhao, Aneesh Pappu, Dawn Song
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they also exhibit memorization of their training data. This phenomenon raises critical questions about model behavior, privacy risks, and the boundary between learning and memorization. Addressing these concerns, this paper synthesizes recent studies and investigates the landscape of memorization, the factors influencing it, and methods for its detection and mitigation. We explore key drivers, including training data duplication, training dynamics, and fine-tuning procedures that influence data memorization. In addition, we examine methodologies such as prefix-based extraction, membership inference, and adversarial prompting, assessing their effectiveness in detecting and measuring memorized content. Beyond technical analysis, we also explore the broader implications of memorization, including the legal and ethical implications. Finally, we discuss mitigation strategies, including data cleaning, differential privacy, and post-training unlearning, while highlighting open challenges in balancing the minimization of harmful memorization with utility. This paper provides a comprehensive overview of the current state of research on LLM memorization across technical, privacy, and performance dimensions, identifying critical directions for future work.
中文: 本文全面研究了大语言模型的记忆现象,分析了其成因、检测方法和缓解策略,同时探讨了相关的隐私风险与伦理影响。
English: This paper comprehensively examines the memorization phenomenon in Large Language Models, analyzing its causes, detection methods, and mitigation strategies while addressing associated privacy risks and ethical implications.
Authors:Boyuan Wang, Xinpan Meng, Xiaofeng Wang, Zheng Zhu, Angen Ye, Yang Wang, Zhiqin Yang, Chaojun Ni, Guan Huang, Xingang Wang
Abstract:
The rapid advancement of Embodied AI has led to an increasing demand for large-scale, high-quality real-world data. However, collecting such embodied data remains costly and inefficient. As a result, simulation environments have become a crucial surrogate for training robot policies. Yet, the significant Real2Sim2Real gap remains a critical bottleneck, particularly in terms of physical dynamics and visual appearance. To address this challenge, we propose EmbodieDreamer, a novel framework that reduces the Real2Sim2Real gap from both the physics and appearance perspectives. Specifically, we propose PhysAligner, a differentiable physics module designed to reduce the Real2Sim physical gap. It jointly optimizes robot-specific parameters such as control gains and friction coefficients to better align simulated dynamics with real-world observations. In addition, we introduce VisAligner, which incorporates a conditional video diffusion model to bridge the Sim2Real appearance gap by translating low-fidelity simulated renderings into photorealistic videos conditioned on simulation states, enabling high-fidelity visual transfer. Extensive experiments validate the effectiveness of EmbodieDreamer. The proposed PhysAligner reduces physical parameter estimation error by 3.74% compared to simulated annealing methods while improving optimization speed by 89.91\%. Moreover, training robot policies in the generated photorealistic environment leads to a 29.17% improvement in the average task success rate across real-world tasks after reinforcement learning. Code, model and data will be publicly available.
具身智能因现实数据采集昂贵而受限,EmbodieDreamer框架通过PhysAligner优化物理参数和VisAligner提升视觉真实感,有效缩小虚实差距,显著提高机器人策略训练的成功率与效率。
Embodied AI's reliance on costly real-world data is addressed by EmbodieDreamer, which narrows the Real2Sim2Real gap through PhysAligner for physics optimization and VisAligner for visual realism, enhancing robot policy training efficiency and success rates.
Authors:Chenfei Xiong, Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Lorena Calvo-Bartolomé, Alexander Hoyle, Zhijing Jin, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Mennatallah El-Assady, Elliott Ash
Abstract:
We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.
中文:Co-DETECT是一种协同标注框架,通过结合专家知识与大语言模型的自动标注能力,以迭代方式优化文本分类中的编码规则,有效处理边缘案例。
English: Co-DETECT is a collaborative annotation framework that combines human expertise with LLM-driven automation to iteratively refine codebooks and handle edge cases in text classification through an efficient, rule-based process.
Authors:Yecheng Wu, Junyu Chen, Zhuoyang Zhang, Enze Xie, Jincheng Yu, Junsong Chen, Jinyi Hu, Yao Lu, Song Han, Han Cai
Abstract:
We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models.
Chinese: DC-AR是一种新颖的掩码自回归框架,通过引入深度压缩混合分词器,在提升图像生成质量的同时显著提高了计算效率,相比现有模型实现了更高的吞吐量和更低的延迟,取得了最先进的成果。
English: DC-AR is a novel masked autoregressive framework that enhances image generation quality and computational efficiency by introducing a deep compression hybrid tokenizer, achieving state-of-the-art results with significantly higher throughput and lower latency than previous models.
Authors:Haotong Qin, Xianglong Liu, Xudong Ma, Lei Ke, Yulun Zhang, Jie Luo, Michele Magno
Abstract:
Deep neural networks for real-time video matting suffer significant computational limitations on edge devices, hindering their adoption in widespread applications such as online conferences and short-form video production. Binarization emerges as one of the most common compression approaches with compact 1-bit parameters and efficient bitwise operations. However, accuracy and efficiency limitations exist in the binarized video matting network due to its degenerated encoder and redundant decoder. Following a theoretical analysis based on the information bottleneck principle, the limitations are mainly caused by the degradation of prediction-relevant information in the intermediate features and the redundant computation in prediction-irrelevant areas. We present BiVM, an accurate and resource-efficient Binarized neural network for Video Matting. First, we present a series of binarized computation structures with elastic shortcuts and evolvable topologies, enabling the constructed encoder backbone to extract high-quality representation from input videos for accurate prediction. Second, we sparse the intermediate feature of the binarized decoder by masking homogeneous parts, allowing the decoder to focus on representation with diverse details while alleviating the computation burden for efficient inference. Furthermore, we construct a localized binarization-aware mimicking framework with the information-guided strategy, prompting matting-related representation in full-precision counterparts to be accurately and fully utilized. Comprehensive experiments show that the proposed BiVM surpasses alternative binarized video matting networks, including state-of-the-art (SOTA) binarization methods, by a substantial margin. Moreover, our BiVM achieves significant savings of 14.3x and 21.6x in computation and storage costs, respectively. We also evaluate BiVM on ARM CPU hardware.
中文:BiVM是一种高效的二值化视频抠图神经网络,通过弹性捷径和可进化拓扑结构提升编码器精度,同时稀疏化解码器特征以减少计算负担,实现了计算和存储成本的大幅节省。
English: BiVM is an efficient binarized neural network for video matting that enhances accuracy through elastic shortcuts and evolvable topologies in its encoder while reducing computational load by sparsifying decoder features, achieving substantial savings in computation and storage.
Authors:Gongwei Chen, Xurui Zhou, Rui Shao, Yibo Lyu, Kaiwen Zhou, Shuai Wang, Wentao Li, Yinchuan Li, Zhongang Qi, Liqiang Nie
Abstract:
The research focus of GUI agents is shifting from text-dependent to pure-vision-based approaches, which, though promising, prioritize comprehensive pre-training data collection while neglecting contextual modeling challenges. We probe the characteristics of element and history contextual modeling in GUI agent and summarize: 1) the high-density and loose-relation of element context highlight the existence of many unrelated elements and their negative influence; 2) the high redundancy of history context reveals the inefficient history modeling in current GUI agents. In this work, we propose a context-aware simplification framework for building an efficient and effective GUI Agent, termed SimpAgent. To mitigate potential interference from numerous unrelated elements, we introduce a masking-based element pruning method that circumvents the intractable relation modeling through an efficient masking mechanism. To reduce the redundancy in historical information, we devise a consistency-guided history compression module, which enhances implicit LLM-based compression through innovative explicit guidance, achieving an optimal balance between performance and efficiency. With the above components, SimpAgent reduces 27% FLOPs and achieves superior GUI navigation performances. Comprehensive navigation experiments across diverse web and mobile environments demonstrate the effectiveness and potential of our agent.
中文: 本研究提出SimpAgent框架,通过屏蔽无关界面元素和压缩冗余历史记录来优化GUI代理的上下文建模,在减少27%计算量的同时显著提升了跨平台导航性能。
English: The study introduces SimpAgent, a context-aware framework that enhances GUI agent efficiency by pruning unrelated elements and compressing redundant history, reducing computational load by 27% while improving navigation performance.
Authors:Antonio Emanuele CinÃ, Maura Pintor, Luca Demetrio, Ambra Demontis, Battista Biggio, Fabio Roli
Abstract:
Despite significant progress in designing powerful adversarial evasion attacks for robustness verification, the evaluation of these methods often remains inconsistent and unreliable. Many assessments rely on mismatched models, unverified implementations, and uneven computational budgets, which can lead to biased results and a false sense of security. Consequently, robustness claims built on such flawed testing protocols may be misleading and give a false sense of security. As a concrete step toward improving evaluation reliability, we present AttackBench, a benchmark framework developed to assess the effectiveness of gradient-based attacks under standardized and reproducible conditions. AttackBench serves as an evaluation tool that ranks existing attack implementations based on a novel optimality metric, which enables researchers and practitioners to identify the most reliable and effective attack for use in subsequent robustness evaluations. The framework enforces consistent testing conditions and enables continuous updates, making it a reliable foundation for robustness verification.
Chinese: 当前对抗性规避攻击的评估常存在不一致和不可靠的问题,导致结果偏差和误导性的鲁棒性声明,为此开发了AttackBench这一标准化基准框架,用于对攻击实现进行排名,以支持可靠的鲁棒性验证。
English: Current evaluations of adversarial evasion attacks are often inconsistent and unreliable, leading to biased results and misleading robustness claims, prompting the development of AttackBench, a standardized benchmark framework that ranks attack implementations for reliable robustness verification.
Authors:Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, Li Yuan
Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning. However, they often excessively rely on textual information during the later stages of inference, neglecting the crucial integration of visual input. Current methods typically address this by explicitly injecting visual information to guide the reasoning process. In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual information injection. This spontaneous shift in focus suggests that MLLMs are intrinsically capable of performing visual fusion reasoning. Building on this insight, we introduce Look-Back, an implicit approach designed to guide MLLMs to ``look back" at visual information in a self-directed manner during reasoning. Look-Back empowers the model to autonomously determine when, where, and how to re-focus on visual inputs, eliminating the need for explicit model-structure constraints or additional input. We demonstrate that Look-Back significantly enhances the model's reasoning and perception capabilities, as evidenced by extensive empirical evaluations on multiple multimodal benchmarks.
中文: 本文提出Look-Back方法,通过隐性引导使多模态大语言模型在推理过程中自主重聚焦视觉信息,无需改变模型结构即可显著提升多模态推理与感知能力。
English: This paper introduces Look-Back, an implicit method that guides Multimodal Large Language Models to autonomously refocus on visual inputs during reasoning, significantly enhancing their multimodal capabilities without structural modifications.
Authors:Jing Bi, Chenliang Xu
Abstract:
Learning to perform activities through demonstration requires extracting meaningful information about the environment from observations. In this research, we investigate the challenge of planning high-level goal-oriented actions in a simulation setting from an egocentric perspective. We present a novel task, interactive action planning, and propose an approach that combines topological affordance memory with transformer architecture. The process of memorizing the environment's structure through extracting affordances facilitates selecting appropriate actions based on the context. Moreover, the memory model allows us to detect action deviations while accomplishing specific objectives. To assess the method's versatility, we evaluate it in a realistic interactive simulation environment. Our experimental results demonstrate that the proposed approach learns meaningful representations, resulting in improved performance and robust when action deviations occur.
中文摘要:本研究提出了一种结合拓扑可供性记忆与Transformer架构的交互式行动规划方法,通过记忆环境结构来增强行动选择能力,并在模拟环境中有效检测行动偏差。
English Summary: This research introduces an interactive action planning approach that combines topological affordance memory with transformers to improve goal-oriented action selection and detect deviations in simulated environments.
Authors:Zhuoyang Zhang, Luke J. Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, Song Han
Abstract:
We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256$\times$256 res.) and 1024 to 48 (512$\times$512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4$\times$ lower latency than previous parallelized autoregressive models.
Chinese: 局部感知并行解码(LPD)通过引入灵活并行化建模和局部感知生成顺序,将256×256分辨率图像的生成步骤从256步减少到20步,在保持质量的同时实现了至少3.4倍的延迟降低。
English: The Locality-aware Parallel Decoding (LPD) method accelerates autoregressive image generation by introducing flexible parallelized modeling and locality-aware ordering, reducing steps from 256 to 20 for 256x256 resolution and achieving over 3.4x lower latency without quality loss.
Authors:Jingjing Qu, Kejia Hu, Jun Zhu, Wenhao Li, Teng Wang, Zhiyun Chen, Yulei Ye, Chaochao Lu, Aimin Zhou, Xiangfeng Wang, James Evans
Abstract:
The integration of Large Language Models (LLMs) into social science experiments represents a transformative approach to understanding human-AI interactions and their societal impacts. We introduce Epitome, the world's first open experimental platform dedicated to the deep integration of artificial intelligence and social science. Rooted in theoretical foundations from management, communication studies, sociology, psychology, and ethics, Epitome focuses on the interactive impacts of AI on individuals, organizations, and society during its real-world deployment. It constructs a theoretical support system through cross-disciplinary experiments. The platform offers a one-stop comprehensive experimental solution spanning "foundation models-complex application development-user feedback" through seven core modules, while embedding the classical "control-comparison-comparative causal logic" of social science experiments into multilevel human-computer interaction environments, including dialogues, group chats, and multi-agent virtual scenarios. With its canvas-style, user-friendly interface, Epitome enables researchers to easily design and run complex experimental scenarios, facilitating systematic investigations into the social impacts of AI and exploration of integrated solutions.To demonstrate its capabilities, we replicated three seminal social science experiments involving LLMs, showcasing Epitome's potential to streamline complex experimental designs and produce robust results, suitable for publishing in the top selective journals. Our findings highlight the platform's utility in enhancing the efficiency and quality of human-AI interactions, providing valuable insights into the societal implications of AI technologies. Epitome thus offers a powerful tool for advancing interdisciplinary research at the intersection of AI and social science, with potential applications in policy-making, ...
中文摘要:Epitome是全球首个深度融合人工智能与社会科学的开放实验平台,通过跨学科实验和直观界面助力研究人员探索人机交互的社会影响,推动前沿学术研究。
English Summary: Epitome is the first open platform integrating AI with social science to study human-AI interactions through multidisciplinary experiments, offering user-friendly tools for designing complex scenarios and generating robust research outcomes.
Authors:Djamahl Etchegaray, Yuxia Fu, Zi Huang, Yadan Luo
Abstract:
Interpretable communication is essential for safe and trustworthy autonomous driving, yet current vision-language models (VLMs) often operate under idealized assumptions and struggle to capture user intent in real-world scenarios. Existing driving-oriented VQA datasets are limited to full-scene descriptions or waypoint prediction, preventing the assessment of whether VLMs can respond to localized user-driven queries. We introduce Box-QAymo, a box-referring dataset and benchmark designed to both evaluate and finetune VLMs on spatial and temporal reasoning over user-specified objects. Users express intent by drawing bounding boxes, offering a fast and intuitive interface for focused queries in complex scenes. Specifically, we propose a hierarchical evaluation protocol that begins with binary sanity-check questions to assess basic model capacities, and progresses to (1) attribute prediction for box-referred objects, (2) motion understanding of target instances, and (3) spatiotemporal motion reasoning over inter-object dynamics across frames. To support this, we crowd-sourced fine-grained object classes and visual attributes that reflect the complexity drivers encounter, and extract object trajectories to construct temporally grounded QA pairs. Rigorous quality control through negative sampling, temporal consistency checks, and difficulty-aware balancing guarantee dataset robustness and diversity. Our comprehensive evaluation reveals significant limitations in current VLMs when queried about perception questions, highlighting the gap in achieving real-world performance. This work provides a foundation for developing more robust and interpretable autonomous driving systems that can communicate effectively with users under real-world conditions. Project page and dataset are available at https://djamahl99.github.io/qaymo-pages/.
中文: 本文提出Box-QAymo数据集和基准,通过用户指定的边界框评估并增强视觉语言模型的时空推理能力,以解决当前自动驾驶可解释通信中的实际缺陷。
English: This paper introduces Box-QAymo, a dataset and benchmark that evaluates and enhances vision-language models' spatial-temporal reasoning through user-specified bounding boxes, addressing current limitations in interpretable communication for autonomous driving.
Authors:Ehsan Latif, Zirak Khan, Xiaoming Zhai
Abstract:
Scientific sketches (e.g., models) offer a powerful lens into students' conceptual understanding, yet AI-powered automated assessment of such free-form, visually diverse artifacts remains a critical challenge. Existing solutions often treat sketch evaluation as either an image classification task or monolithic vision-language models, which lack interpretability, pedagogical alignment, and adaptability across cognitive levels. To address these limitations, we present SketchMind, a cognitively grounded, multi-agent framework for evaluating and improving student-drawn scientific sketches. SketchMind comprises modular agents responsible for rubric parsing, sketch perception, cognitive alignment, and iterative feedback with sketch modification, enabling personalized and transparent evaluation. We evaluate SketchMind on a curated dataset of 3,575 student-generated sketches across six science assessment items with different highest order of Bloom's level that require students to draw models to explain phenomena. Compared to baseline GPT-4o performance without SRG (average accuracy: 55.6%), and with SRG integration achieves 77.1% average accuracy (+21.4% average absolute gain). We also demonstrate that multi-agent orchestration with SRG enhances SketchMind performance, for example, GPT-4.1 gains an average 8.9% increase in sketch prediction accuracy, outperforming single-agent pipelines across all items. Human evaluators rated the feedback and co-created sketches generated by \textsc{SketchMind} with GPT-4.1, which achieved an average of 4.1 out of 5, significantly higher than those of baseline models (e.g., 2.3 for GPT-4o). Experts noted the system's potential to meaningfully support conceptual growth through guided revision. Our code and (pending approval) dataset will be released to support reproducibility and future research in AI-driven education.
Chinese: SketchMind提出了一种基于认知的多智能体框架,通过模块化评估和迭代反馈,显著提升了AI对学生科学草图评估的准确性、可解释性和个性化指导能力。
English: SketchMind introduces a cognitively grounded, multi-agent framework that significantly improves AI-based evaluation of student-drawn scientific sketches by enhancing accuracy, interpretability, and personalized feedback compared to existing methods.
Authors:Zekai Sun, Xiuxian Guan, Zheng Lin, Yuhao Qing, Haoze Song, Zihan Fang, Zhe Chen, Fangming Liu, Heming Cui, Wei Ni, Jun Luo
Abstract:
Deploying Machine Learning (ML) applications on resource-constrained mobile devices remains challenging due to limited computational resources and poor platform compatibility. While Mobile Edge Computing (MEC) offers offloading-based inference paradigm using GPU servers, existing approaches are divided into non-transparent and transparent methods, with the latter necessitating modifications to the source code. Non-transparent offloading achieves high performance but requires intrusive code modification, limiting compatibility with diverse applications. Transparent offloading, in contrast, offers wide compatibility but introduces significant transmission delays due to per-operator remote procedure calls (RPCs). To overcome this limitation, we propose RRTO, the first high-performance transparent offloading system tailored for MEC inference. RRTO introduces a record/replay mechanism that leverages the static operator sequence in ML models to eliminate repetitive RPCs. To reliably identify this sequence, RRTO integrates a novel Operator Sequence Search algorithm that detects repeated patterns, filters initialization noise, and accelerates matching via a two-level strategy. Evaluation demonstrates that RRTO achieves substantial reductions of up to 98% in both per-inference latency and energy consumption compared to state-of-the-art transparent methods and yields results comparable to non-transparent approaches, all without necessitating any source code modification.
中文: RRTO是一种面向移动边缘计算的高性能透明卸载系统,通过记录/回放机制和算子序列搜索消除重复远程调用,在不修改代码的情况下实现了接近非透明方法的性能。
English: RRTO is a high-performance transparent offloading system for mobile edge computing that eliminates repetitive remote procedure calls through a record/replay mechanism and operator sequence search, achieving near-non-transparent performance without code modifications.
Authors:Wei Guan, Jun Lan, Jian Cao, Hao Tan, Huijia Zhu, Weiqiang Wang
Abstract:
Industrial anomaly detection (IAD) plays a crucial role in maintaining the safety and reliability of manufacturing systems. While multimodal large language models (MLLMs) show strong vision-language reasoning abilities, their effectiveness in IAD remains limited without domain-specific adaptation. In this work, we propose EMIT, a unified framework that enhances MLLMs for IAD via difficulty-aware group relative policy optimization (GRPO). EMIT constructs a multi-task IAD dataset and utilizes GPT-generated object text descriptions to compensate for missing defective images. For few-shot anomaly detection, it integrates a soft prompt and heatmap-guided contrastive embeddings derived from patch-level comparisons. To better handle difficult data samples, i.e., cases where the MLLM struggles to generate correct answers, we propose a difficulty-aware GRPO that extends the original GRPO by incorporating a response resampling strategy to ensure the inclusion of correct answers in the sampled responses, as well as an advantage reweighting mechanism to strengthen learning from such difficult data samples. Extensive experiments on the MMAD benchmark demonstrate that EMIT significantly enhances the IAD performance of MLLMs, achieving an average improvement of 7.77\% over the base model (InternVL3-8B) across seven tasks.
中文摘要:本文提出EMIT框架,通过难度感知分组相对策略优化增强多模态大语言模型在工业异常检测中的能力,在基准测试中实现了显著性能提升。
English Summary: This paper introduces EMIT, a unified framework that enhances multimodal large language models for industrial anomaly detection through difficulty-aware group relative policy optimization, achieving significant performance improvements on benchmark tasks.
Authors:Yihan Sun, Yuqi Cheng, Yunkang Cao, Yuxin Zhang, Weiming Shen
Abstract:
3D anomaly detection is critical in industrial quality inspection. While existing methods achieve notable progress, their performance degrades in high-precision 3D anomaly detection due to insufficient global information. To address this, we propose Multi-View Reconstruction (MVR), a method that losslessly converts high-resolution point clouds into multi-view images and employs a reconstruction-based anomaly detection framework to enhance global information learning. Extensive experiments demonstrate the effectiveness of MVR, achieving 89.6\% object-wise AU-ROC and 95.7\% point-wise AU-ROC on the Real3D-AD benchmark.
中文: 本文提出多视角重建(MVR)方法,通过将高分辨率点云无损转换为多视角图像来增强全局信息学习,在Real3D-AD基准测试中实现了优异的3D异常检测性能。
English: This paper introduces Multi-View Reconstruction (MVR), a method that transforms high-resolution point clouds into multi-view images to improve global information learning for enhanced 3D anomaly detection, achieving state-of-the-art results on the Real3D-AD benchmark.
Authors:Yongzheng Liu, Yiming Wang, Po Xu, Yingjie Xu, Yuntian Chen, Dongxiao Zhang
Abstract:
Due to the extensive availability of operation data, data-driven methods show strong capabilities in predicting building energy loads. Buildings with similar features often share energy patterns, reflected by spatial dependencies in their operational data, which conventional prediction methods struggle to capture. To overcome this, we propose a multi-building prediction approach using spatio-temporal graph neural networks, comprising graph representation, graph learning, and interpretation. First, a graph is built based on building characteristics and environmental factors. Next, a multi-level graph convolutional architecture with attention is developed for energy prediction. Lastly, a method interpreting the optimized graph structure is introduced. Experiments on the Building Data Genome Project 2 dataset confirm superior performance over baselines such as XGBoost, SVR, FCNN, GRU, and Naive, highlighting the method's robustness, generalization, and interpretability in capturing meaningful building similarities and spatial relationships.
中文: 本研究提出了一种基于时空图神经网络的多建筑能耗预测方法,通过构建图结构和注意力机制,在Building Data Genome Project 2数据集上验证了其优于传统模型的性能,能有效捕捉建筑间的空间依赖性和相似性。
English: The study introduces a multi-building energy prediction method using spatio-temporal graph neural networks, which outperforms traditional models by effectively capturing spatial dependencies and building similarities, as validated on the Building Data Genome Project 2 dataset.
Authors:Seoyoung Doh, Hyeon Jeon, Sungbok Shin, Ghulam Jilani Quadri, Nam Wook Kim, Jinwook Seo
Abstract:
Selecting the dimensionality reduction technique that faithfully represents the structure is essential for reliable visual communication and analytics. In reality, however, practitioners favor projections for other attractions, such as aesthetics and visual saliency, over the projection's structural faithfulness, a bias we define as visual interestingness. In this research, we conduct a user study that (1) verifies the existence of such bias and (2) explains why the bias exists. Our study suggests that visual interestingness biases practitioners' preferences when selecting projections for analysis, and this bias intensifies with color-encoded labels and shorter exposure time. Based on our findings, we discuss strategies to mitigate bias in perceiving and interpreting DR projections.
中文: 本研究证实了实践者存在视觉趣味性偏见,即倾向于选择美观而非结构保真的降维投影,这种偏见在彩色标签和短暂暴露下加剧,并提出了缓解策略。
English: This study confirms that practitioners exhibit a visual interestingness bias, favoring aesthetically appealing projections over structurally faithful ones, which intensifies with color labels and brief exposure, and proposes mitigation strategies.
Authors:Yuzhang Xie, Xu Han, Ran Xu, Xiao Hu, Jiaying Lu, Carl Yang
Abstract:
Knowledge graphs (KGs) are important products of the semantic web, which are widely used in various application domains. Healthcare is one of such domains where KGs are intensively used, due to the high requirement for knowledge accuracy and interconnected nature of healthcare data. However, KGs storing general factual information often lack the ability to account for important contexts of the knowledge such as the status of specific patients, which are crucial in precision healthcare. Meanwhile, electronic health records (EHRs) provide rich personal data, including various diagnoses and medications, which provide natural contexts for general KGs. In this paper, we propose HypKG, a framework that integrates patient information from EHRs into KGs to generate contextualized knowledge representations for accurate healthcare predictions. Using advanced entity-linking techniques, we connect relevant knowledge from general KGs with patient information from EHRs, and then utilize a hypergraph model to "contextualize" the knowledge with the patient information. Finally, we employ hypergraph transformers guided by downstream prediction tasks to jointly learn proper contextualized representations for both KGs and patients, fully leveraging existing knowledge in KGs and patient contexts in EHRs. In experiments using a large biomedical KG and two real-world EHR datasets, HypKG demonstrates significant improvements in healthcare prediction tasks across multiple evaluation metrics. Additionally, by integrating external contexts, HypKG can learn to adjust the representations of entities and relations in KG, potentially improving the quality and real-world utility of knowledge.
Chinese: 本文提出HypKG框架,通过超图变换器将电子健康记录与知识图谱相融合,构建情境化知识表示,显著提升了医疗预测任务的性能表现。
English: This paper introduces HypKG, a framework that integrates electronic health records with knowledge graphs using hypergraph transformers to create contextualized knowledge representations, significantly enhancing healthcare prediction accuracy.
Authors:Vangelis Kostoulas, Arthur Guijt, Ellen M. Kerkhof, Bradley R. Pieters, Peter A. N. Bosman, Tanja Alderliesten
Abstract:
Brachytherapy involves bringing a radioactive source near tumor tissue using implanted needles. Image-guided brachytherapy planning requires amongst others, the reconstruction of the needles. Manually annotating these needles on patient images can be a challenging and time-consuming task for medical professionals. For automatic needle reconstruction, a two-stage pipeline is commonly adopted, comprising a segmentation stage followed by a post-processing stage. While deep learning models are effective for segmentation, their results often contain errors. No currently existing post-processing technique is robust to all possible segmentation errors. We therefore propose adaptations to existing post-processing techniques mainly aimed at dealing with segmentation errors and thereby improving the reconstruction accuracy. Experiments on a prostate cancer dataset, based on MRI scans annotated by medical professionals, demonstrate that our proposed adaptations can help to effectively manage segmentation errors, with the best adapted post-processing technique achieving median needle-tip and needle-bottom point localization errors of $1.07$ (IQR $\pm 1.04$) mm and $0.43$ (IQR $\pm 0.46$) mm, respectively, and median shaft error of $0.75$ (IQR $\pm 0.69$) mm with 0 false positive and 0 false negative needles on a test set of 261 needles.
中文: 本研究改进了近距离放疗针重建中的后处理技术,以更有效处理深度学习模型的分割误差,在前列腺癌MRI数据集上显著提升了定位精度。
English: This study proposes enhancements to existing post-processing techniques in brachytherapy needle reconstruction to better handle segmentation errors from deep learning models, achieving significantly improved localization accuracy on a prostate cancer MRI dataset.
Authors:Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos, Christos Diou
Abstract:
Bias in computer vision models remains a significant challenge, often resulting in unfair, unreliable, and non-generalizable AI systems. Although research into bias mitigation has intensified, progress continues to be hindered by fragmented implementations and inconsistent evaluation practices. Disparate datasets and metrics used across studies complicate reproducibility, making it difficult to fairly assess and compare the effectiveness of various approaches. To overcome these limitations, we introduce the Visual Bias Mitigator (VB-Mitigator), an open-source framework designed to streamline the development, evaluation, and comparative analysis of visual bias mitigation techniques. VB-Mitigator offers a unified research environment encompassing 12 established mitigation methods, 7 diverse benchmark datasets. A key strength of VB-Mitigator is its extensibility, allowing for seamless integration of additional methods, datasets, metrics, and models. VB-Mitigator aims to accelerate research toward fairness-aware computer vision models by serving as a foundational codebase for the research community to develop and assess their approaches. To this end, we also recommend best evaluation practices and provide a comprehensive performance comparison among state-of-the-art methodologies.
中文: VB-Mitigator是一个开源框架,通过整合多种视觉偏见缓解方法与数据集来解决评估标准碎片化问题,旨在推动计算机视觉领域建立更公平的AI系统。
English: The Visual Bias Mitigator (VB-Mitigator) is an open-source framework that unifies diverse bias mitigation methods and datasets to address fragmented evaluation practices in computer vision, accelerating the development of fairer AI systems.
Authors:Boheng Li, Renjie Gu, Junjie Wang, Leyi Qi, Yiming Li, Run Wang, Zhan Qin, Tianwei Zhang
Abstract:
Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are identified to be fragile to downstream fine-tuning, where we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau Envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety after downstream fine-tuning while preserving benign generation capability well.
Chinese: 本文提出了ResAlign,一种具有韧性的安全驱动遗忘框架,能有效抑制文本到图像扩散模型中的有害行为,并在下游良性数据集微调后仍保持安全性,优于现有方法。
English: The paper introduces ResAlign, a resilient safety-driven unlearning framework that effectively mitigates toxic behaviors in text-to-image diffusion models and maintains safety even after downstream fine-tuning on benign datasets, outperforming prior methods.
Authors:Haoyue Zhang, Jie Zhang, Song Guo
Abstract:
Although vision transformers (ViT) have shown remarkable success in various vision tasks, their computationally expensive self-attention hinder their deployment on resource-constrained devices. Token reduction, which discards less important tokens during forward propagation, has been proposed to enhance the efficiency of transformer models. However, existing methods handle unimportant tokens irreversibly, preventing their reuse in subsequent blocks. Considering that transformers focus on different information among blocks, tokens reduced in early blocks might be useful later. Furthermore, to adapt transformer models for resource-constrained devices, it is crucial to strike a balance between model performance and computational overhead. To address these challenges, in this paper, we introduce a novel Token Freezing and Reusing (ToFe) framework, where we identify important tokens at each stage and temporarily freeze the unimportant ones, allowing their lagged reusing at a later stage. Specifically, we design a prediction module for token identification and an approximate module for recovery of the frozen tokens. By jointly optimizing with the backbone through computation budget-aware end-to-end training, ToFe can adaptively process the necessary tokens at each block, thereby reducing computational cost while maintaining performance. Extensive experiments demonstrate that ToFe reduces the computational cost of LV-ViT model by 50% with less than 2% drop in Top-1 accuracy, achieving a better trade-off between performance and complexity compared to state-of-the-art methods.
中文: ToFe框架通过自适应地冻结和重用视觉Transformer中的令牌,将计算成本降低50%且性能损失极小,实现了效率与精度的更优平衡。
English: The ToFe framework adaptively freezes and reuses tokens in vision transformers to reduce computational costs by 50% with minimal performance loss, achieving a superior balance between efficiency and accuracy.
Authors:Wenjie Huang, Qi Yang, Shuting Xia, He Huang, Zhu Li, Yiling Xu
Abstract:
Existing AI-based point cloud compression methods struggle with dependence on specific training data distributions, which limits their real-world deployment. Implicit Neural Representation (INR) methods solve the above problem by encoding overfitted network parameters to the bitstream, resulting in more distribution-agnostic results. However, due to the limitation of encoding time and decoder size, current INR based methods only consider lossy geometry compression. In this paper, we propose the first INR based lossless point cloud geometry compression method called Lossless Implicit Neural Representations for Point Cloud Geometry Compression (LINR-PCGC). To accelerate encoding speed, we design a group of point clouds level coding framework with an effective network initialization strategy, which can reduce around 60% encoding time. A lightweight coding network based on multiscale SparseConv, consisting of scale context extraction, child node prediction, and model compression modules, is proposed to realize fast inference and compact decoder size. Experimental results show that our method consistently outperforms traditional and AI-based methods: for example, with the convergence time in the MVUB dataset, our method reduces the bitstream by approximately 21.21% compared to G-PCC TMC13v23 and 21.95% compared to SparsePCGC. Our project can be seen on https://huangwenjie2023.github.io/LINR-PCGC/.
Chinese: 本文提出了LINR-PCGC,这是首个基于隐式神经表示的无损点云几何压缩方法,通过优化编码框架和轻量级网络设计,在显著减少编码时间的同时实现了比现有方法更优的压缩性能。
English: This paper introduces LINR-PCGC, the first lossless point cloud geometry compression method using Implicit Neural Representations, which significantly reduces encoding time and achieves superior compression rates compared to existing methods.
Authors:Fangqiu Yi, Jingyu Xu, Jiawei Shao, Chi Zhang, Xuelong Li
Abstract:
Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction. Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals. Our approach introduces three key modules: (1) Multi-granular conditioning that captures both static scene structure and dynamic spatio-temporal cues; (2) Compact representations designed for efficient transmission without sacrificing semantic richness; (3) Multi-condition training with modality dropout and role-aware embeddings, which prevent over-reliance on any single modality and enhance robustness. Extensive experiments show that our method significantly outperforms both traditional and neural codecs on perceptual quality metrics such as Fréchet Video Distance (FVD) and LPIPS, especially under high compression ratios.
Chinese: 本研究提出了一种基于条件扩散模型的视频压缩框架,通过感知优化重构,在高压缩比下显著超越了传统和神经编解码器在感知质量指标上的表现。
English: This study introduces a video compression framework using conditional diffusion models to achieve perceptually optimized reconstruction, which significantly outperforms traditional and neural codecs in perceptual quality metrics under high compression ratios.
Authors:Tanjin Taher Toma, Tejas Sudharshan Mathai, Bikash Santra, Pritam Mukherjee, Jianfei Liu, Wesley Jong, Darwish Alabyad, Vivek Batheja, Abhishek Jha, Mayank Patel, Darko Pucar, Jayadira del Rivero, Karel Pacak, Ronald M. Summers
Abstract:
Accurate segmentation of pheochromocytoma (PCC) in abdominal CT scans is essential for tumor burden estimation, prognosis, and treatment planning. It may also help infer genetic clusters, reducing reliance on expensive testing. This study systematically evaluates anatomical priors to identify configurations that improve deep learning-based PCC segmentation. We employed the nnU-Net framework to evaluate eleven annotation strategies for accurate 3D segmentation of pheochromocytoma, introducing a set of novel multi-class schemes based on organ-specific anatomical priors. These priors were derived from adjacent organs commonly surrounding adrenal tumors (e.g., liver, spleen, kidney, aorta, adrenal gland, and pancreas), and were compared against a broad body-region prior used in previous work. The framework was trained and tested on 105 contrast-enhanced CT scans from 91 patients at the NIH Clinical Center. Performance was measured using Dice Similarity Coefficient (DSC), Normalized Surface Distance (NSD), and instance-wise F1 score. Among all strategies, the Tumor + Kidney + Aorta (TKA) annotation achieved the highest segmentation accuracy, significantly outperforming the previously used Tumor + Body (TB) annotation across DSC (p = 0.0097), NSD (p = 0.0110), and F1 score (25.84% improvement at an IoU threshold of 0.5), measured on a 70-30 train-test split. The TKA model also showed superior tumor burden quantification (R^2 = 0.968) and strong segmentation across all genetic subtypes. In five-fold cross-validation, TKA consistently outperformed TB across IoU thresholds (0.1 to 0.5), reinforcing its robustness and generalizability. These findings highlight the value of incorporating relevant anatomical context into deep learning models to achieve precise PCC segmentation, offering a valuable tool to support clinical assessment and longitudinal disease monitoring in PCC patients.
Chinese: 本研究证明,在腹部CT扫描中采用肿瘤+肾脏+主动脉(TKA)标注策略作为解剖学先验知识,能显著提升基于深度学习的嗜铬细胞瘤分割效果,其性能优于既往方法,为临床评估提供了更精准的工具。
English: This study demonstrates that incorporating anatomical priors, particularly the Tumor + Kidney + Aorta (TKA) annotation strategy, significantly enhances deep learning-based pheochromocytoma segmentation in abdominal CT scans, outperforming previous methods and improving clinical assessment tools.
Authors:Zhexuan Xu, Jie Wang, Siyuan Xu, Zijie Geng, Mingxuan Yuan, Feng Wu
Abstract:
Floorplanning determines the shapes and locations of modules on a chip canvas and plays a critical role in optimizing the chip's Power, Performance, and Area (PPA) metrics. However, existing floorplanning approaches often fail to integrate with subsequent physical design stages, leading to suboptimal in-module component placement and excessive inter-module feedthrough. To tackle this challenge, we propose Flora, a three-stage feedthrough and placement aware rectilinear floorplanner. In the first stage, Flora employs wiremask and position mask techniques to achieve coarse-grained optimization of HPWL and feedthrough. In the second stage, under the constraint of a fixed outline, Flora achieves a zero-whitespace layout by locally resizing module shapes, thereby performing fine-grained optimization of feedthrough and improving component placement. In the third stage, Flora utilizes a fast tree search-based method to efficiently place components-including macros and standard cells-within each module, subsequently adjusting module boundaries based on the placement results to enable cross-stage optimization. Experimental results show that Flora outperforms recent state-of-the-art floorplanning approaches, achieving an average reduction of 6% in HPWL, 5.16% in FTpin, 29.15% in FTmod, and a 14% improvement in component placement performance.
中文: Flora是一种三阶段矩形平面规划器,通过集成馈通和布局感知优化芯片设计,在布线长度、馈通和组件布局方面相比现有方法取得了显著提升。
English: Flora is a three-stage rectilinear floorplanner that optimizes chip design by integrating feedthrough and placement awareness, achieving significant improvements in wirelength, feedthrough, and component placement over existing methods.
Authors:Yang Liu, Li Zhang, Fang Liu, Zhuohang Wang, Donglin Wei, Zhishuo Yang, Kechi Zhang, Jia Li, Lin Shi
Abstract:
Repository-level code generation aims to generate code within the context of a specified repository. Existing approaches typically employ retrieval-augmented generation (RAG) techniques to provide LLMs with relevant contextual information extracted from the repository. However, these approaches often struggle with effectively identifying truly relevant contexts that capture the rich semantics of the repository, and their contextual perspectives remains narrow. Moreover, most approaches fail to account for the structural relationships in the retrieved code during prompt construction, hindering the LLM's ability to accurately interpret the context. To address these issues, we propose RepoScope, which leverages call chain-aware multi-view context for repository-level code generation. RepoScope constructs a Repository Structural Semantic Graph (RSSG) and retrieves a comprehensive four-view context, integrating both structural and similarity-based contexts. We propose a novel call chain prediction method that utilizes the repository's structural semantics to improve the identification of callees in the target function. Additionally, we present a structure-preserving serialization algorithm for prompt construction, ensuring the coherence of the context for the LLM. Notably, RepoScope relies solely on static analysis, eliminating the need for additional training or multiple LLM queries, thus ensuring both efficiency and generalizability. Evaluation on widely-used repository-level code generation benchmarks (CoderEval and DevEval) demonstrates that RepoScope outperforms state-of-the-art methods, achieving up to a 36.35% relative improvement in pass@1 scores. Further experiments emphasize RepoScope's potential to improve code generation across different tasks and its ability to integrate effectively with existing approaches.
中文: RepoScope提出了一种基于调用链感知多视角上下文的仓库级代码生成方法,通过构建仓库结构语义图和结构保持序列化算法,有效提升了上下文相关性和结构理解能力,在无需额外训练的情况下显著优于现有方法。
English: RepoScope introduces a repository-level code generation method that utilizes call chain-aware multi-view context and a structure-preserving serialization algorithm to enhance context relevance and structural understanding, achieving significant performance improvements over existing approaches without requiring additional training.
Authors:Go Nishikawa, Wataru Nakata, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari, Tomohiko Nakamura
Abstract:
We introduce our submission to the AudioMOS Challenge (AMC) 2025 Track 3: mean opinion score (MOS) prediction for speech with multiple sampling frequencies (SFs). Our submitted model integrates an SF-independent (SFI) convolutional layer into a self-supervised learning (SSL) model to achieve SFI speech feature extraction for MOS prediction. We present some strategies to improve the MOS prediction performance of our model: distilling knowledge from a pretrained non-SFI-SSL model and pretraining with a large-scale MOS dataset. Our submission to the AMC 2025 Track 3 ranked the first in one evaluation metric and the fourth in the final ranking. We also report the results of our ablation study to investigate essential factors of our model.
Chinese: 我们为2025年音频MOS挑战赛设计的模型采用独立于采样频率的卷积层结合自监督学习进行MOS预测,通过知识蒸馏和大规模预训练提升性能,在比赛中获得一项指标第一和综合排名第四的成绩。
English: Our model for the AudioMOS Challenge 2025 uses a sampling frequency-independent convolutional layer with self-supervised learning for MOS prediction, enhanced by knowledge distillation and large-scale pretraining, achieving top rankings in the competition.
Authors:Niccolò Di Marco, Anita Bonetti, Edoardo Di Martino, Edoardo Loru, Jacopo Nudo, Mario Edoardo Pandolfo, Giulio Pecile, Emanuele Sangiorgio, Irene Scalco, Simon Zollo, Matteo Cinelli, Fabiana Zollo, Walter Quattrociocchi
Abstract:
The rise of digital platforms has enabled the large scale observation of individual and collective behavior through high resolution interaction data. This development has opened new analytical pathways for investigating how information circulates, how opinions evolve, and how coordination emerges in online environments. Yet despite a growing body of research, the field remains fragmented and marked by methodological heterogeneity, limited model validation, and weak integration across domains. This survey offers a systematic synthesis of empirical findings and formal models. We examine platform-level regularities, assess the methodological architectures that generate them, and evaluate the extent to which current modeling frameworks account for observed dynamics. The goal is to consolidate a shared empirical baseline and clarify the structural constraints that shape inference in this domain, laying the groundwork for more robust, comparable, and actionable analyses of online social systems.
中文: 数字平台实现了对在线行为的大规模观察,但由于方法异质性和验证不足,研究仍显零散;本综述系统整合实证发现与模型,为在线社交系统的更稳健分析奠定共同基础。
English: Digital platforms enable large-scale observation of online behavior, but research remains fragmented due to methodological heterogeneity and weak validation; this survey systematically synthesizes empirical findings and models to establish a shared baseline for more robust analyses of social systems.
Authors:Yuqi Cheng, Yunkang Cao, Haiming Yao, Wei Luo, Cheng Jiang, Hui Zhang, Weiming Shen
Abstract:
Industrial defect detection is vital for upholding product quality across contemporary manufacturing systems. As the expectations for precision, automation, and scalability intensify, conventional inspection approaches are increasingly found wanting in addressing real-world demands. Notable progress in computer vision and deep learning has substantially bolstered defect detection capabilities across both 2D and 3D modalities. A significant development has been the pivot from closed-set to open-set defect detection frameworks, which diminishes the necessity for extensive defect annotations and facilitates the recognition of novel anomalies. Despite such strides, a cohesive and contemporary understanding of industrial defect detection remains elusive. Consequently, this survey delivers an in-depth analysis of both closed-set and open-set defect detection strategies within 2D and 3D modalities, charting their evolution in recent years and underscoring the rising prominence of open-set techniques. We distill critical challenges inherent in practical detection environments and illuminate emerging trends, thereby providing a current and comprehensive vista of this swiftly progressing field.
中文: 本综述深入分析了工业缺陷检测从封闭集到开放集方法的演变,涵盖了2D和3D模式下的发展现状,并探讨了该领域当前面临的挑战与新兴趋势。
English: This survey comprehensively examines the evolution of industrial defect detection, highlighting the shift from traditional closed-set methods to advanced open-set frameworks in both 2D and 3D modalities, while addressing current challenges and emerging trends in the field.
Authors:Yiting Yang, Hao Luo, Yuan Sun, Qingsen Yan, Haokui Zhang, Wei Dong, Guoqing Wang, Peng Wang, Yang Yang, Hengtao Shen
Abstract:
A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this work, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model's generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range of downstream image classification tasks, confirming the efficacy of the enhanced generalization capability embedded in the down/up-projection matrices.
中文: 本文提出近似正交微调方法,通过生成正交投影矩阵匹配主干网络特性,增强泛化能力并在图像分类任务中取得优异性能。
English: This paper proposes an Approximately Orthogonal Fine-Tuning (AOFT) method that generates orthogonal down/up-projection matrices to match the backbone's properties, enhancing generalization and achieving competitive performance in image classification tasks.
Authors:Jingyao Wang, Yiming Chen, Lingyu Si, Changwen Zheng
Abstract:
Scene understanding is one of the core tasks in computer vision, aiming to extract semantic information from images to identify objects, scene categories, and their interrelationships. Although advancements in Vision-Language Models (VLMs) have driven progress in this field, existing VLMs still face challenges in adaptation to unseen complex wide-area scenes. To address the challenges, this paper proposes a Hierarchical Coresets Selection (HCS) mechanism to advance the adaptation of VLMs in complex wide-area scene understanding. It progressively refines the selected regions based on the proposed theoretically guaranteed importance function, which considers utility, representativeness, robustness, and synergy. Without requiring additional fine-tuning, HCS enables VLMs to achieve rapid understandings of unseen scenes at any scale using minimal interpretable regions while mitigating insufficient feature density. HCS is a plug-and-play method that is compatible with any VLM. Experiments demonstrate that HCS achieves superior performance and universality in various tasks.
中文: 本文提出分层核心集选择机制,通过无需微调逐步选取最小可解释区域,提升视觉语言模型对复杂广域场景的适应性,在多项任务中实现优越性能。
English: This paper introduces a Hierarchical Coresets Selection (HCS) mechanism to enhance Vision-Language Models' adaptability for complex wide-area scene understanding by progressively selecting minimal interpretable regions without fine-tuning, achieving superior performance across tasks.
Authors:Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, Wangmeng Zuo
Abstract:
Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.
Chinese: LoViC是一种基于扩散变换器的框架,通过FlexFormer将视频和文本压缩为统一潜在表示,有效解决了自注意力二次复杂度带来的挑战,生成长且连贯的视频内容。
English: LoViC is a diffusion transformer framework that generates long, coherent videos by using FlexFormer to compress video and text into unified latent representations, overcoming the limitations of quadratic complexity in self-attention.
Authors:Hyeon Jeon, Jeongin Park, Soohyun Lee, Dae Hyun Kim, Sungbok Shin, Jinwook Seo
Abstract:
Selecting the appropriate dimensionality reduction (DR) technique and determining its optimal hyperparameter settings that maximize the accuracy of the output projections typically involves extensive trial and error, often resulting in unnecessary computational overhead. To address this challenge, we propose a dataset-adaptive approach to DR optimization guided by structural complexity metrics. These metrics quantify the intrinsic complexity of a dataset, predicting whether higher-dimensional spaces are necessary to represent it accurately. Since complex datasets are often inaccurately represented in two-dimensional projections, leveraging these metrics enables us to predict the maximum achievable accuracy of DR techniques for a given dataset, eliminating redundant trials in optimizing DR. We introduce the design and theoretical foundations of these structural complexity metrics. We quantitatively verify that our metrics effectively approximate the ground truth complexity of datasets and confirm their suitability for guiding dataset-adaptive DR workflow. Finally, we empirically show that our dataset-adaptive workflow significantly enhances the efficiency of DR optimization without compromising accuracy.
中文: 本研究提出了一种基于数据集结构复杂度度量的自适应降维优化方法,通过预测可达到的最高精度来避免冗余试验,在不影响性能的前提下显著提升了优化效率。
English: This study introduces a dataset-adaptive dimensionality reduction optimization method that uses structural complexity metrics to predict the maximum achievable accuracy, eliminating redundant trials and significantly improving efficiency without compromising performance.
Authors:Chi-en Amy Tai, Alexander Wong
Abstract:
Peptide de novo sequencing is a method used to reconstruct amino acid sequences from tandem mass spectrometry data without relying on existing protein sequence databases. Traditional deep learning approaches, such as Casanovo, mainly utilize autoregressive decoders and predict amino acids sequentially. Subsequently, they encounter cascading errors and fail to leverage high-confidence regions effectively. To address these issues, this paper investigates using diffusion decoders adapted for the discrete data domain. These decoders provide a different approach, allowing sequence generation to start from any peptide segment, thereby enhancing prediction accuracy. We experiment with three different diffusion decoder designs, knapsack beam search, and various loss functions. We find knapsack beam search did not improve performance metrics and simply replacing the transformer decoder with a diffusion decoder lowered performance. Although peptide precision and recall were still 0, the best diffusion decoder design with the DINOISER loss function obtained a statistically significant improvement in amino acid recall by 0.373 compared to the baseline autoregressive decoder-based Casanovo model. These findings highlight the potential of diffusion decoders to not only enhance model sensitivity but also drive significant advancements in peptide de novo sequencing.
中文: 本研究探索了扩散解码器在肽从头测序中的应用,发现尽管整体肽精度未变,但采用DINOISER损失函数的最佳扩散模型相比基线Casanovo模型将氨基酸召回率显著提升了0.373,展现了该方法在提高序列预测精度方面的潜力。
English: This study explores the use of diffusion decoders for peptide de novo sequencing, finding that while overall peptide precision remained unchanged, the best diffusion model with the DINOISER loss significantly improved amino acid recall by 0.373 compared to the baseline Casanovo model, demonstrating their potential for advancing sequence prediction accuracy.
Authors:Fei Zhao, Chonggang Lu, Yue Wang, Zheyong Xie, Ziyan Liu, Haofu Qian, JianZhao Huang, Fangcheng Shi, Zijie Meng, Hongcheng Guo, Mingqian He, Xinze Lyu, Yiming Lu, Ziyang Xiang, Zheyu Ye, Chengqiang Lu, Zhe Xu, Yi Wu, Yao Hu, Yan Gao, Jun Fan, Xiaolong Jiang, Weiting Liu, Boyang Wang, Shaosheng Cao
Abstract:
As a primary medium for modern information dissemination, social networking services (SNS) have experienced rapid growth, which has proposed significant challenges for platform content management and interaction quality improvement. Recently, the development of large language models (LLMs) has offered potential solutions but existing studies focus on isolated tasks, which not only encounter diminishing benefit from the data scaling within individual scenarios but also fail to flexibly adapt to diverse real-world context. To address these challenges, we introduce RedOne, a domain-specific LLM designed to break the performance bottleneck of single-task baselines and establish a comprehensive foundation for the SNS. RedOne was developed through a three-stage training strategy consisting of continue pretraining, supervised fine-tuning, and preference optimization, using a large-scale real-world dataset. Through extensive experiments, RedOne maintains strong general capabilities, and achieves an average improvement up to 14.02% across 8 major SNS tasks and 7.56% in SNS bilingual evaluation benchmark, compared with base models. Furthermore, through online testing, RedOne reduced the exposure rate in harmful content detection by 11.23% and improved the click page rate in post-view search by 14.95% compared with single-tasks finetuned baseline models. These results establish RedOne as a robust domain-specific LLM for SNS, demonstrating excellent generalization across various tasks and promising applicability in real-world scenarios.
中文摘要:RedOne是专为社交网络服务设计的领域大语言模型,通过三阶段训练策略在多项任务和实际应用中实现了显著性能提升。
English Summary: RedOne is a domain-specific large language model designed for social networking services that achieves significant performance improvements across multiple tasks and real-world applications through a three-stage training strategy.
Authors:Despina Konstantinidou, Dimitrios Karageorgiou, Christos Koutlis, Olga Papadopoulou, Emmanouil Schinas, Symeon Papadopoulos
Abstract:
The rapid advancement of generative technologies presents both unprecedented creative opportunities and significant challenges, particularly in maintaining social trust and ensuring the integrity of digital information. Following these concerns, the challenge of AI-Generated Image Detection (AID) becomes increasingly critical. As these technologies become more sophisticated, the quality of AI-generated images has reached a level that can easily deceive even the most discerning observers. Our systematic evaluation highlights a critical weakness in current AI-Generated Image Detection models: while they perform exceptionally well on controlled benchmark datasets, they struggle significantly with real-world variations. To assess this, we introduce ITW-SM, a new dataset of real and AI-generated images collected from major social media platforms. In this paper, we identify four key factors that influence AID performance in real-world scenarios: backbone architecture, training data composition, pre-processing strategies and data augmentation combinations. By systematically analyzing these components, we shed light on their impact on detection efficacy. Our modifications result in an average AUC improvement of 26.87% across various AID models under real-world conditions.
中文摘要:生成技术的快速发展对数字信息真实性构成挑战,当前AI生成图像检测模型在现实场景中表现不佳,通过系统优化关键因素实现了26.87%的平均AUC提升。
English Summary: Generative technology's rise poses challenges in digital trust, with AI-generated image detection models struggling in real-world scenarios despite strong benchmark performance, leading to a 26.87% AUC improvement through systematic adjustments.
Authors:Sangwoo Park, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang
Abstract:
Scientific paper retrieval, particularly framed as document-to-document retrieval, aims to identify relevant papers in response to a long-form query paper, rather than a short query string. Previous approaches to this task have focused on abstracts, embedding them into dense vectors as surrogates for full documents and calculating similarity across them, although abstracts provide only sparse and high-level summaries. To address this, we propose PRISM, a novel document-to-document retrieval method that introduces multiple, fine-grained representations for both the query and candidate papers. In particular, each query paper is decomposed into multiple aspect-specific views and individually embedded, which are then matched against candidate papers similarity segmented to consider their multifaceted dimensions. Moreover, we present SciFullBench, a novel benchmark in which the complete and segmented context of full papers for both queries and candidates is available. Then, experimental results show that PRISM improves performance by an average of 4.3% over existing retrieval baselines.
中文: PRISM提出了一种新颖的文档到文档检索方法,通过为查询和候选论文创建多粒度表征,在新型SciFullBench基准测试中相比现有基线平均提升4.3%的性能。
English: PRISM introduces a novel document-to-document retrieval method using multiple fine-grained representations for query and candidate papers, improving performance by 4.3% over existing baselines as demonstrated on the new SciFullBench benchmark.
Authors:Xinyue Liu, Xiao Peng, Shuyue Yan, Yuntian Chen, Dongxiao Zhang, Zhixiao Niu, Hui-Min Wang, Xiaogang He
Abstract:
Observed records of climate extremes provide an incomplete picture of risk, missing "unseen" extremes that exceed historical bounds. In parallel, neglecting spatial dependence undervalues the risk of synchronized hazards that amplify impacts. To address these challenges, we develop DeepX-GAN (Dependence-Enhanced Embedding for Physical eXtremes - Generative Adversarial Network), a knowledge-informed deep generative model designed to better capture the spatial structure of rare extremes. The zero-shot generalizability of DeepX-GAN enables simulation of unseen extremes that fall outside historical experience yet remain statistically plausible. We define two types of unseen extremes: "checkmate" extremes that directly hit targets, and "stalemate" extremes that narrowly miss. These unrealized scenarios expose latent risks in fragile systems and may reinforce a false sense of resilience if overlooked. Near misses, in particular, can prompt either proactive adaptation or dangerous complacency, depending on how they are interpreted. Applying DeepX-GAN to the Middle East and North Africa (MENA), we find that these unseen extremes disproportionately affect regions with high vulnerability and low socioeconomic readiness, but differ in urgency and interpretation. Future warming could expand and redistribute these unseen extremes, with emerging exposure hotspots in Indo-Pakistan and Central Africa. This distributional shift highlights critical blind spots in conventional hazard planning and underscores the need to develop spatially adaptive policies that anticipate emergent risk hotspots rather than simply extrapolating from historical patterns.
中文: DeepX-GAN是一种知识引导的生成模型,能模拟超出历史记录的不可见极端气候事件,揭示中东和北非等脆弱地区的潜在风险,并强调需制定适应性政策以应对新兴风险热点。
English: DeepX-GAN is a knowledge-informed generative model that simulates unseen climate extremes beyond historical records, revealing latent risks in vulnerable regions like MENA and highlighting the need for adaptive policies to address emergent hotspots.
Authors:Chi-en Amy Tai, Pengyu Nie, Lukasz Golab, Alexander Wong
Abstract:
Studies show that large language models (LLMs) produce buggy code translations. One promising avenue to improve translation accuracy is through intermediate representations, which provide structured guidance for the translation process. We investigate whether LLM-based code translation can benefit from intermediate representations, specifically in the form of natural language (NL) summaries and abstract syntax trees (ASTs). Since prompt engineering greatly affects LLM performance, we consider several ways to integrate these representations, from one-shot to chain-of-thought (CoT) prompting. Using Open GPT4 8X7B and specialized StarCoder and CodeGen models on popular code translation benchmarks (CodeNet and AVATAR), we find that CoT with an intermediate NL summary performs best, with an increase of 13.8% and 6.7%, respectively, in successful translations for the best-performing model (Open GPT4 8X7B) compared to the zero-shot prompt.
中文:通过思维链提示生成的中间自然语言摘要能显著提升大语言模型的代码翻译准确率,相比零样本方法,最佳模型的翻译成功率最高可提升13.8%。
English: Intermediate natural language summaries using chain-of-thought prompting significantly improve LLM-based code translation accuracy, boosting success rates by up to 13.8% for the top-performing model compared to zero-shot approaches.
Authors:Davide Salvi, Viola Negroni, Sara Mandelli, Paolo Bestagini, Stefano Tubaro
Abstract:
Recent advances in generative AI have made the creation of speech deepfakes widely accessible, posing serious challenges to digital trust. To counter this, various speech deepfake detection strategies have been proposed, including Person-of-Interest (POI) approaches, which focus on identifying impersonations of specific individuals by modeling and analyzing their unique vocal traits. Despite their excellent performance, the existing methods offer limited granularity and lack interpretability. In this work, we propose a POI-based speech deepfake detection method that operates at the phoneme level. Our approach decomposes reference audio into phonemes to construct a detailed speaker profile. In inference, phonemes from a test sample are individually compared against this profile, enabling fine-grained detection of synthetic artifacts. The proposed method achieves comparable accuracy to traditional approaches while offering superior robustness and interpretability, key aspects in multimedia forensics. By focusing on phoneme analysis, this work explores a novel direction for explainable, speaker-centric deepfake detection.
中文:近期生成式AI的进步使得语音深度伪造技术广泛可用,对数字信任构成严重威胁,而现有检测方法缺乏精细度和可解释性;本研究提出了一种音素级的人物定向检测方法,通过将测试样本的音素与详细说话人特征进行比对,在保持相当准确性的同时实现了更优的鲁棒性和可解释性。
English: Recent generative AI advances have made speech deepfakes widely accessible, posing serious threats to digital trust, while current detection methods lack granularity and interpretability; this work proposes a phoneme-level Person-of-Interest detection approach that achieves comparable accuracy with superior robustness and interpretability by analyzing individual phonemes against detailed speaker profiles.
Authors:Subhajit Maity, Ayan Kumar Bhunia, Subhadeep Koley, Pinaki Nath Chowdhury, Aneeshan Sain, Yi-Zhe Song
Abstract:
Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.
中文摘要:该框架通过将草图作为无源替代方案,采用原型设置结合网格定位和领域自适应,解决了少样本关键点检测中的跨模态嵌入和用户特定风格等挑战。
English Summary: The proposed framework addresses few-shot keypoint detection challenges by using sketches as a source-free alternative, employing a prototypical setup with grid-based localization and domain adaptation to handle cross-modal embeddings and user-specific styles.
Authors:Yuqi Cheng, Yihan Sun, Hui Zhang, Weiming Shen, Yunkang Cao
Abstract:
In industrial point cloud analysis, detecting subtle anomalies demands high-resolution spatial data, yet prevailing benchmarks emphasize low-resolution inputs. To address this disparity, we propose a scalable pipeline for generating realistic and subtle 3D anomalies. Employing this pipeline, we developed MiniShift, the inaugural high-resolution 3D anomaly detection dataset, encompassing 2,577 point clouds, each with 500,000 points and anomalies occupying less than 1\% of the total. We further introduce Simple3D, an efficient framework integrating Multi-scale Neighborhood Descriptors (MSND) and Local Feature Spatial Aggregation (LFSA) to capture intricate geometric details with minimal computational overhead, achieving real-time inference exceeding 20 fps. Extensive evaluations on MiniShift and established benchmarks demonstrate that Simple3D surpasses state-of-the-art methods in both accuracy and speed, highlighting the pivotal role of high-resolution data and effective feature aggregation in advancing practical 3D anomaly detection.
Chinese: 针对工业点云异常检测中低分辨率数据的不足,本研究提出了包含细微异常的高分辨率数据集MiniShift和高效框架Simple3D,通过多尺度特征处理实现了卓越的检测精度与实时性能。
English: To address the limitations of low-resolution data in industrial point cloud anomaly detection, this study introduces MiniShift, a high-resolution dataset with subtle anomalies, and Simple3D, an efficient framework that achieves superior accuracy and real-time performance through multi-scale feature processing.
Authors:Haoyue Bai, Haoyu Wang, Nanxu Gong, Xinyuan Wang, Wangyang Ying, Haifeng Chen, Yanjie Fu
Abstract:
High responsiveness and economic efficiency are critical objectives in supply chain transportation, both of which are influenced by strategic decisions on shipping mode. An integrated framework combining an efficient simulator with an intelligent decision-making algorithm can provide an observable, low-risk environment for transportation strategy design. An ideal simulation-decision framework must (1) generalize effectively across various settings, (2) reflect fine-grained transportation dynamics, (3) integrate historical experience with predictive insights, and (4) maintain tight integration between simulation feedback and policy refinement. We propose Sim-to-Dec framework to satisfy these requirements. Specifically, Sim-to-Dec consists of a generative simulation module, which leverages autoregressive modeling to simulate continuous state changes, reducing dependence on handcrafted domain-specific rules and enhancing robustness against data fluctuations; and a history-future dual-aware decision model, refined iteratively through end-to-end optimization with simulator interactions. Extensive experiments conducted on three real-world datasets demonstrate that Sim-to-Dec significantly improves timely delivery rates and profit.
中文:Sim-to-Dec框架通过生成式模拟和双感知决策模型,在可观测环境中优化运输策略,显著提高供应链的及时交付率和利润。
English: The Sim-to-Dec framework integrates generative simulation and dual-aware decision-making to enhance supply chain responsiveness and profitability by generalizing across settings and refining strategies through iterative optimization.
Authors:Fengran Mo, Yifan Gao, Chuan Meng, Xin Liu, Zhuofeng Wu, Kelong Mao, Zhengyang Wang, Pei Chen, Zheng Li, Xian Li, Bing Yin, Meng Jiang
Abstract:
The rapid advancement of conversational search systems revolutionizes how information is accessed by enabling the multi-turn interaction between the user and the system. Existing conversational search systems are usually built with two different models. This separation restricts the system from leveraging the intrinsic knowledge of the models simultaneously, which cannot ensure the effectiveness of retrieval benefiting the generation. The existing studies for developing unified models cannot fully address the aspects of understanding conversational context, managing retrieval independently, and generating responses. In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation. We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks while mitigating data discrepancy. The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.
中文: 本研究提出了一种统一模型,通过联合微调对话搜索中的密集检索和响应生成,在多个数据集上提升了两个任务的性能,并超越了现有基准。
English: This study introduces a unified model that jointly fine-tunes dense retrieval and response generation in conversational search, improving both tasks and outperforming existing baselines across multiple datasets.
Authors:Qilong Xing, Zikai Song, Bingxin Gong, Lian Yang, Junqing Yu, Wei Yang
Abstract:
Accurate prognosis of non-small cell lung cancer (NSCLC) patients undergoing immunotherapy is essential for personalized treatment planning, enabling informed patient decisions, and improving both treatment outcomes and quality of life. However, the lack of large, relevant datasets and effective multi-modal feature fusion strategies pose significant challenges in this domain. To address these challenges, we present a large-scale dataset and introduce a novel framework for multi-modal feature fusion aimed at enhancing the accuracy of survival prediction. The dataset comprises 3D CT images and corresponding clinical records from NSCLC patients treated with immune checkpoint inhibitors (ICI), along with progression-free survival (PFS) and overall survival (OS) data. We further propose a cross-modality masked learning approach for medical feature fusion, consisting of two distinct branches, each tailored to its respective modality: a Slice-Depth Transformer for extracting 3D features from CT images and a graph-based Transformer for learning node features and relationships among clinical variables in tabular data. The fusion process is guided by a masked modality learning strategy, wherein the model utilizes the intact modality to reconstruct missing components. This mechanism improves the integration of modality-specific features, fostering more effective inter-modality relationships and feature interactions. Our approach demonstrates superior performance in multi-modal integration for NSCLC survival prediction, surpassing existing methods and setting a new benchmark for prognostic models in this context.
中文: 本研究提出了一种新颖的跨模态掩码学习框架和大规模数据集,通过整合三维CT影像与临床记录,利用切片深度Transformer和图结构Transformer分别处理不同模态数据,显著提升了非小细胞肺癌免疫治疗患者的生存预测准确性,其性能超越现有方法。
English: This study introduces a novel cross-modality masked learning framework and a large-scale dataset integrating 3D CT scans with clinical records to significantly improve survival prediction accuracy for NSCLC patients undergoing immunotherapy, outperforming existing methods through enhanced multi-modal feature fusion.
Authors:Qilong Xing, Zikai Song, Youjia Zhang, Na Feng, Junqing Yu, Wei Yang
Abstract:
Despite significant advancements in adapting Large Language Models (LLMs) for radiology report generation (RRG), clinical adoption remains challenging due to difficulties in accurately mapping pathological and anatomical features to their corresponding text descriptions. Additionally, semantic agnostic feature extraction further hampers the generation of accurate diagnostic reports. To address these challenges, we introduce Medical Concept Aligned Radiology Report Generation (MCA-RG), a knowledge-driven framework that explicitly aligns visual features with distinct medical concepts to enhance the report generation process. MCA-RG utilizes two curated concept banks: a pathology bank containing lesion-related knowledge, and an anatomy bank with anatomical descriptions. The visual features are aligned with these medical concepts and undergo tailored enhancement. We further propose an anatomy-based contrastive learning procedure to improve the generalization of anatomical features, coupled with a matching loss for pathological features to prioritize clinically relevant regions. Additionally, a feature gating mechanism is employed to filter out low-quality concept features. Finally, the visual features are corresponding to individual medical concepts, and are leveraged to guide the report generation process. Experiments on two public benchmarks (MIMIC-CXR and CheXpert Plus) demonstrate that MCA-RG achieves superior performance, highlighting its effectiveness in radiology report generation.
中文摘要:MCA-RG框架通过病理库和解剖库将视觉特征与医学概念对齐,采用对比学习和特征门控机制提升放射学报告生成质量,在公开基准测试中表现优异。
English Summary: MCA-RG is a knowledge-driven framework that aligns visual features with medical concepts using pathology and anatomy banks, enhancing radiology report generation through contrastive learning and feature gating, achieving superior performance on public benchmarks.
Authors:Liqiang Jing, Viet Lai, Seunghyun Yoon, Trung Bui, Xinya Du
Abstract:
Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks. However, they often suffer fro hallucinations, generating content that contradicts the visual input. Existing evaluation methods are limited to one task (e.g., V2T) and also fail to assess hallucinations in open-ended, free-form responses. To address this gap, we propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts, models their semantic dependencies via a Spatio-Temporal Semantic Dependency Graph, and verifies them using VideoQA models. We further introduce Post-Correction, a tool-based correction framework that revises hallucinated content. Extensive experiments demonstrate that FIFA aligns more closely with human judgment than existing evaluation methods, and that Post-Correction effectively improves factual consistency in both text and video generation.
中文:FIFA是一个统一的评估框架,通过建模语义依赖关系并利用VideoQA验证事实,有效评估和修正VideoMLLMs中的幻觉问题,实验表明其更贴近人类判断并能显著提升生成内容的真实性。
English: FIFA is a unified framework for evaluating and correcting hallucinations in VideoMLLMs by modeling semantic dependencies and verifying facts with VideoQA, showing superior alignment with human judgment and improved factual consistency.
Authors:Zicong Tang, Shi Luohe, Zuchao Li, Baoyuan Qi, Guoming Liu, Lefei Zhang, Ping Wang
Abstract:
Large Language Models (LLMs) have achieved impressive accomplishments in recent years. However, the increasing memory consumption of KV cache has possessed a significant challenge to the inference system. Eviction methods have revealed the inherent redundancy within the KV cache, demonstrating its potential for reduction, particularly in deeper layers. However, KV cache reduction for shallower layers has been found to be insufficient. Based on our observation that, the KV cache exhibits a high degree of similarity. Based on this observation, we proposed a novel KV cache reduction method, SpindleKV, which balances both shallow and deep layers. For deep layers, we employ an attention weight based eviction method, while for shallow layers, we apply a codebook based replacement approach which is learnt by similarity and merging policy. Moreover, SpindleKV addressed the Grouped-Query Attention (GQA) dilemma faced by other attention based eviction methods. Experiments on two common benchmarks with three different LLMs shown that SpindleKV obtained better KV cache reduction effect compared to baseline methods, while preserving similar or even better model performance.
中文: SpindleKV是一种新颖的KV缓存缩减方法,通过分别对浅层和深层采用基于码本的替换和基于注意力权重的淘汰策略,在保持模型性能的同时有效解决了GQA困境。
English: SpindleKV is a novel method that effectively reduces KV cache in both shallow and deep layers of LLMs by employing codebook-based replacement and attention-weight eviction, respectively, while maintaining model performance and addressing the GQA challenge.
Authors:Zhaoze Sun, Qiyan Deng, Chengliang Chai, Kaisen Jin, Xinyu Guo, Han Han, Ye Yuan, Guoren Wang, Lei Cao
Abstract:
Most recently, researchers have started building large language models (LLMs) powered data systems that allow users to analyze unstructured text documents like working with a database because LLMs are very effective in extracting attributes from documents. In such systems, LLM-based extraction operations constitute the performance bottleneck of query execution due to the high monetary cost and slow LLM inference. Existing systems typically borrow the query optimization principles popular in relational databases to produce query execution plans, which unfortunately are ineffective in minimizing LLM cost. To fill this gap, we propose QUEST, which features a bunch of novel optimization strategies for unstructured document analysis. First, we introduce an index-based strategy to minimize the cost of each extraction operation. With this index, QUEST quickly retrieves the text segments relevant to the target attributes and only feeds them to LLMs. Furthermore, we design an evidence-augmented retrieval strategy to reduce the possibility of missing relevant segments. Moreover, we develop an instance-optimized query execution strategy: because the attribute extraction cost could vary significantly document by document, QUEST produces different plans for different documents. For each document, QUEST produces a plan to minimize the frequency of attribute extraction. The innovations include LLM cost-aware operator ordering strategies and an optimized join execution approach that transforms joins into filters. Extensive experiments on 3 real-world datasets demonstrate the superiority of QUEST, achieving 30%-6x cost savings while improving the F1 score by 10% -27% compared with state-of-the-art baselines.
中文摘要:研究者提出了QUEST系统,通过采用索引策略、证据增强检索和实例优化查询执行等创新方法,有效降低了大型语言模型在非结构化文本分析中的使用成本和频率,同时显著提升了处理精度。
English Summary: Researchers have introduced QUEST, a novel system that optimizes query execution for unstructured text analysis by employing cost-aware strategies like indexing, evidence-augmented retrieval, and instance-specific plans to significantly reduce LLM usage and costs while improving accuracy.
Authors:Yudan Song, Yuecen Wei, Yuhang Lu, Qingyun Sun, Minglai Shao, Li-e Wang, Chunming Hu, Xianxian Li, Xingcheng Fu
Abstract:
Graph representation learning has become a mainstream method for fraud detection due to its strong expressive power, which focuses on enhancing node representations through improved neighborhood knowledge capture. However, the focus on local interactions leads to imbalanced transmission of global topological information and increased risk of node-specific information being overwhelmed during aggregation due to the imbalance between fraud and benign nodes. In this paper, we first summarize the impact of topology and class imbalance on downstream tasks in GNN-based fraud detection, as the problem of imbalanced supervisory messages is caused by fraudsters' topological behavior obfuscation and identity feature concealment. Based on statistical validation, we propose a novel dual-view graph representation learning method to mitigate Message imbalance in Fraud Detection (MimbFD). Specifically, we design a topological message reachability module for high-quality node representation learning to penetrate fraudsters' camouflage and alleviate insufficient propagation. Then, we introduce a local confounding debiasing module to adjust node representations, enhancing the stable association between node representations and labels to balance the influence of different classes. Finally, we conducted experiments on three public fraud datasets, and the results demonstrate that MimbFD exhibits outstanding performance in fraud detection.
中文: 图表示学习在欺诈检测中面临信息传播和类别不平衡的挑战,本文提出的MimbFD方法通过双视角学习增强节点表征和分类稳定性,有效提升了检测性能。
English: Graph representation learning for fraud detection faces challenges from imbalanced message propagation and class distribution, which the proposed MimbFD method addresses through dual-view learning to enhance node representation and classification stability.
Authors:Yixiang Chen, Peiyan Li, Yan Huang, Jiabing Yang, Kehan Chen, Liang Wang
Abstract:
Current language-guided robotic manipulation systems often require low-level action-labeled datasets for imitation learning. While object-centric flow prediction methods mitigate this issue, they remain limited to scenarios involving rigid objects with clear displacement and minimal occlusion. In this work, we present Embodiment-Centric Flow (EC-Flow), a framework that directly learns manipulation from action-unlabeled videos by predicting embodiment-centric flow. Our key insight is that incorporating the embodiment's inherent kinematics significantly enhances generalization to versatile manipulation scenarios, including deformable object handling, occlusions, and non-object-displacement tasks. To connect the EC-Flow with language instructions and object interactions, we further introduce a goal-alignment module by jointly optimizing movement consistency and goal-image prediction. Moreover, translating EC-Flow to executable robot actions only requires a standard robot URDF (Unified Robot Description Format) file to specify kinematic constraints across joints, which makes it easy to use in practice. We validate EC-Flow on both simulation (Meta-World) and real-world tasks, demonstrating its state-of-the-art performance in occluded object handling (62% improvement), deformable object manipulation (45% improvement), and non-object-displacement tasks (80% improvement) than prior state-of-the-art object-centric flow methods. For more information, see our project website at https://ec-flow1.github.io .
中文: 本文提出EC-Flow框架,通过预测具身中心流从无标注视频中学习机器人操作,在处理遮挡、可变形物体和非位移任务方面相比现有方法实现了显著性能提升。
English: This paper introduces EC-Flow, a framework that learns robotic manipulation from unlabeled videos by predicting embodiment-centric flow, achieving significant improvements in handling occlusions, deformable objects, and non-displacement tasks compared to prior methods.
Authors:Haoxi Li, Sikai Bai, Jie Zhang, Song Guo
Abstract:
Large reasoning models (LRMs) have demonstrated impressive capabilities in domains like mathematics and program synthesis. Despite their strong performance, LRMs often exhibit overthinking -- excessive and redundant reasoning steps that introduce inefficiencies during inference. This phenomenon raises an important question for LRM self-evaluation: How can a model autonomously assess the correctness of its own reasoning trajectory without external labels? To address this, we propose Chain-of-Reasoning Embedding (CoRE), a series of hidden states in latent space to enable label-free self-evaluation on intermediate reasoning steps of LRMs, so as to enhance metacognition abilities for improved reasoning efficiency. By analyzing the geometric properties of the CoRE trajectories, we reveal that redundant reasoning usually presents cyclical fluctuations, which correspond to repetitive and unconscious reflection/exploration. Leveraging this insight, we further introduce a training-free, label-free self-evaluation framework, CoRE-Eval, to detect such patterns and dynamically determine whether to terminate reasoning early. Extensive experiments on mathematical reasoning benchmarks (GSM8K, MATH-500, and AIME) and across model sizes from 7B to 32B demonstrate that CoRE-Eval reduces chain-of-thought length by 13.7% to 33.2% while improving answer accuracy by around 10%, achieving 70.0% accuracy on the challenging AIME benchmark with the 32B model.
大型推理模型常因过度思考产生冗余步骤,因此提出的CoRE方法通过分析潜在空间中的隐藏状态轨迹实现无标签自评估,检测循环模式以提前终止推理,在多个基准测试中提升了效率和准确性。
Large reasoning models often overthink with redundant steps, so the proposed CoRE method enables label-free self-evaluation by analyzing hidden state trajectories to detect cyclical patterns and terminate reasoning early, improving efficiency and accuracy across benchmarks.
Authors:Haiwen Li, Delong Liu, Zhaohui Hou, Zhicheng Zhao, Fei Su
Abstract:
As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.
中文: 本文提出了一个可扩展的合成三元组生成流程构建CIRHS数据集,并设计了混合上下文对齐(CoAlign)框架,通过全合成数据训练实现了组合图像检索的卓越零样本和监督性能,首次验证了合成数据训练CIR模型的可行性。
English: This paper introduces a scalable pipeline for generating synthetic triplets to create the CIRHS dataset and proposes the Hybrid Contextual Alignment (CoAlign) framework, which achieves state-of-the-art zero-shot and supervised performance in composed image retrieval by learning robust representations from fully synthetic data.
Authors:Wenkang Zhang, Yan Zhao, Qiang Wang, Li Song, Zhengxue Cheng
Abstract:
Free-viewpoint video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representations remains a major challenge. Recent advances in 3D Gaussian Splatting (3DGS) and its dynamic extensions have enabled high-fidelity scene modeling. However, existing methods often couple scene reconstruction with optimization-dependent coding, which limits generalizability. This paper presents Feedforward Compression of Dynamic Gaussian Splatting (D-FCGS), a novel feedforward framework for compressing temporally correlated Gaussian point cloud sequences. Our approach introduces a Group-of-Frames (GoF) structure with I-P frame coding, where inter-frame motions are extracted via sparse control points. The resulting motion tensors are compressed in a feedforward manner using a dual prior-aware entropy model that combines hyperprior and spatial-temporal priors for accurate rate estimation. For reconstruction, we perform control-point-guided motion compensation and employ a refinement network to enhance view-consistent fidelity. Trained on multi-view video-derived Gaussian frames, D-FCGS generalizes across scenes without per-scene optimization. Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression in under 2 seconds while preserving visual quality across viewpoints. This work advances feedforward compression for dynamic 3DGS, paving the way for scalable FVV transmission and storage in immersive applications.
中文: 本文提出D-FCGS框架,通过前馈压缩动态高斯点云序列,无需逐场景优化即可保持多视角视觉质量,实现了超40倍压缩率。
English: This paper introduces D-FCGS, a feedforward compression framework for dynamic Gaussian splatting that achieves high compression ratios while maintaining visual quality across viewpoints without requiring per-scene optimization.
Authors:Zekai Sun, Xiuxian Guan, Zheng Lin, Zihan Fang, Xiangming Cai, Zhe Chen, Fangming Liu, Heming Cui, Jie Xiong, Wei Ni, Chau Yuen
Abstract:
Deploying deep neural networks (DNNs) on resource-constrained mobile devices presents significant challenges, particularly in achieving real-time performance while simultaneously coping with limited computational resources and battery life. While Mobile Edge Computing (MEC) offers collaborative inference with GPU servers as a promising solution, existing approaches primarily rely on layer-wise model partitioning and undergo significant transmission bottlenecks caused by the sequential execution of DNN operations. To address this challenge, we present Intra-DP, a high-performance collaborative inference system optimized for DNN inference on MEC. Intra DP employs a novel parallel computing technique based on local operators (i.e., operators whose minimum unit input is not the entire input tensor, such as the convolution kernel). By decomposing their computations (operations) into several independent sub-operations and overlapping the computation and transmission of different sub-operations through parallel execution, Intra-DP mitigates transmission bottlenecks in MEC, achieving fast and energy-efficient inference. The evaluation demonstrates that Intra-DP reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines, without sacrificing accuracy.
中文: Intra-DP系统通过并行化局部算子计算来克服移动边缘计算中的传输瓶颈,在不影响准确性的前提下,将推理延迟降低高达50%,能耗减少高达75%。
English: Intra-DP is a collaborative inference system that overcomes transmission bottlenecks in Mobile Edge Computing by parallelizing local operator computations, reducing latency by up to 50% and energy consumption by 75% without compromising accuracy.
Authors:Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Abstract:
Reward hacking in Reinforcement Learning (RL) systems poses a critical threat to the deployment of autonomous agents, where agents exploit flaws in reward functions to achieve high scores without fulfilling intended objectives. Despite growing awareness of this problem, systematic detection and mitigation approaches remain limited. This paper presents a large-scale empirical study of reward hacking across diverse RL environments and algorithms. We analyze 15,247 training episodes across 15 RL environments (Atari, MuJoCo, custom domains) and 5 algorithms (PPO, SAC, DQN, A3C, Rainbow), implementing automated detection algorithms for six categories of reward hacking: specification gaming, reward tampering, proxy optimization, objective misalignment, exploitation patterns, and wireheading. Our detection framework achieves 78.4% precision and 81.7% recall across environments, with computational overhead under 5%. Through controlled experiments varying reward function properties, we demonstrate that reward density and alignment with true objectives significantly impact hacking frequency ($p < 0.001$, Cohen's $d = 1.24$). We validate our approach through three simulated application studies representing recommendation systems, competitive gaming, and robotic control scenarios. Our mitigation techniques reduce hacking frequency by up to 54.6% in controlled scenarios, though we find these trade-offs are more challenging in practice due to concept drift, false positive costs, and adversarial adaptation. All detection algorithms, datasets, and experimental protocols are publicly available to support reproducible research in RL safety.
中文摘要:本文对强化学习中的奖励攻击问题进行了大规模实证研究,开发了自动化检测方法并在多种环境中实现了高精度检测,同时提出的缓解技术使攻击频率最高降低54.6%。
English Summary: This paper conducts a large-scale empirical study on reward hacking in reinforcement learning, developing automated detection methods that achieve high precision and recall across multiple environments while demonstrating mitigation techniques that reduce hacking frequency by up to 54.6%.
Authors:Yong Zhang, Feng Liang, Guanghu Yuan, Min Yang, Chengming Li, Xiping Hu
Abstract:
Federated learning (FL) enables collaborative training of a global model in the centralized server with data from multiple parties while preserving privacy. However, data heterogeneity can significantly degrade the performance of the global model when each party uses datasets from different sources to train a local model, thereby affecting personalized local models. Among various cases of data heterogeneity, feature drift, feature space difference among parties, is prevalent in real-life data but remains largely unexplored. Feature drift can distract feature extraction learning in clients and thus lead to poor feature extraction and classification performance. To tackle the problem of feature drift in FL, we propose FedPall, an FL framework that utilizes prototype-based adversarial learning to unify feature spaces and collaborative learning to reinforce class information within the features. Moreover, FedPall leverages mixed features generated from global prototypes and local features to enhance the global classifier with classification-relevant information from a global perspective. Evaluation results on three representative feature-drifted datasets demonstrate FedPall's consistently superior performance in classification with feature-drifted data in the FL scenario.
中文摘要:联邦学习面临特征漂移导致的性能下降问题,FedPall框架通过基于原型的对抗学习和协作特征增强,有效提升了分类表现。
English Summary: Federated learning faces performance degradation from feature drift, which FedPall addresses through prototype-based adversarial learning and collaborative feature enhancement to improve classification.
Authors:Dong Zhang, Lin Li, Ming Li, Xiaohui Tao, Meng Sun, Jimmy Xiangji Huang
Abstract:
Existing solutions for bundle recommendation(BR) have achieved remarkable effectiveness for predicting the user's preference for prebuilt bundles. However, bundle-item(B-I) affiliation will vary dynamically in real scenarios. For example, a bundle themed as 'casual outfit', may add 'hat' or remove 'watch' due to factors such as seasonal variations, changes in user pes or inventory adjustments. Our empirical study demonstrates that the performance of mainstream BR models will fluctuate or even decline regarding item-level variability. This paper makes the first attempt to referencaddress the above problem and proposes a novel Residual Diffusion for Bundle Recommendation(RDiffBR) as a model-agnostic generative framework which can assist a BR model in adapting this scenario. During the initial training of the BR model, RDiffBR employs a residual diffusion model to process the item-level bundle embeddings which are generated by BR model to represent bundle theme via a forward-reverse process. In the inference stage, RDiffBR reverses item-level bundle embeddings obtained by the well-trained bundle model under B-I variability scenarios to generate the effective item-level bundle embeddings. In particular, the residual connection in our residual approximator significantly enhances item-level bundle embeddings generation ability of BR models. Experiments on six BR models and four public datasets from different domains show that RDiffBR improves the performance of Recall and NDCG of backbone BR models by up to 23%, while only increased training time about 4%.Codes and datasets are available at https://anonymous.4open.science/r/RDiffBR.
中文: 本文提出RDiffBR,一种模型无关的生成框架,通过残差扩散技术动态适应物品级变化来增强捆绑推荐,在仅增加约4%训练时间的情况下将主流模型的召回率和NDCG提升最高达23%。
English: This paper introduces RDiffBR, a model-agnostic generative framework that enhances bundle recommendation by adapting to dynamic item-level changes through residual diffusion, improving recall and NDCG by up to 23% with minimal training overhead.
Authors:Jiaqi Xue, Yifei Zhao, Mengxin Zheng, Fan Yao, Yan Solihin, Qian Lou
Abstract:
Recent advances in Transformer models, e.g., large language models (LLMs), have brought tremendous breakthroughs in various artificial intelligence (AI) tasks, leading to their wide applications in many security-critical domains. Due to their unprecedented scale and prohibitively high development cost, these models have become highly valuable intellectual property for AI stakeholders and are increasingly deployed via machine learning as a service (MLaaS). However, MLaaS often runs on untrusted cloud infrastructure, exposing data and models to potential breaches. Mainstream protection mechanisms leverage trusted execution environments (TEEs) where confidentiality and integrity for secretive data are shielded using hardware-based encryption and integrity checking. Unfortunately, running model inference entirely within TEEs is subject to non-trivial slowdown, which is further exacerbated in LLMs due to the substantial computation and memory footprint involved. Recent studies reveal that the hybrid TEE-based scheme offloading partial model inference operations to the untrusted accelerators (e.g., GPU) is a promising solution. However, prior offloading schemes fail to ensure dual protection of data and model in Transformer inference, as they cannot securely offload critical operations, i.e., Attention and SoftMax, forcing these computations to remain confined within TEEs. To address these challenges, we propose TwinShield, a framework enabling secure Transformer inference in heterogeneous TEE and accelerator systems with dual protection for both model and data. TwinShield offloads ~87% of computation to GPUs and delivers 4.0x - 6.1x speedups over previous approaches across various Transformer models.
中文: 近期Transformer模型的进步使其广泛应用于安全关键领域,但通过MLaaS在不可信基础设施上的部署存在风险,TwinShield通过安全地将计算卸载到GPU,实现了双重保护并显著提升了速度。
English: Recent advances in Transformer models have led to their widespread use in security-critical domains, but their deployment via MLaaS on untrusted infrastructure poses risks, which TwinShield addresses by securely offloading computations to GPUs for dual protection and significant speed improvements.
Authors:Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang
Abstract:
In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.
中文: ASDA模型采用差分注意力机制,有效解决了标准Transformer中注意力分散的问题,在音频分类、关键词识别和环境声音分类任务中均取得了领先性能。
English: The ASDA model introduces a differential attention mechanism that overcomes irrelevant information allocation in standard Transformers, achieving state-of-the-art results in audio classification, keyword spotting, and environmental sound recognition.
Authors:Yongjiang Liu, Haoxi Li, Xiaosong Ma, Jie Zhang, Song Guo
Abstract:
Recent Large Reasoning Models (LRMs) excel at complex reasoning tasks but often suffer from overthinking, generating overly long and redundant reasoning trajectories. To explore its essence, our empirical analysis reveals that LRMs are primarily limited to recognizing task properties (i.e., difficulty levels) like humans before solving the problem, leading to a one-size-fits-all reasoning process. Inspired by this, a pressing and natural question emerges: Can we explicitly bootstrap such ability to alleviate overthinking in LRMs? In this paper, we propose Think-How-to-Think (TH2T), a novel two-stage fine-tuning strategy that progressively inspires LRMs' difficulty cognition and redundancy cognition of LRMs. Specifically, we first inject difficulty hypnosis into output prefixes to guide the model toward adaptive reasoning depth, trained on a hybrid dataset mixing short and long reasoning paths. Then, we incorporate redundancy hypnosis, which supervises the intermediate reasoning steps to identify and eliminate unnecessary reasoning patterns. Experiments on 7B/14B/32B models demonstrate that TH2T significantly reduces inference costs by over 70% on easy tasks and 40% on hard tasks while maintaining performance stability. The resulting outputs exhibit clear signs of difficulty-aware capabilities and reduced redundancy (e.g., reflection and looping).
中文摘要:提出的Think-How-to-Think(TH2T)方法通过两阶段微调,使大型推理模型具备任务难度认知与冗余推理识别能力,在保持性能稳定的同时,将简单任务推理成本降低超70%,显著缓解过度思考问题。
English Summary: The proposed Think-How-to-Think (TH2T) method addresses overthinking in Large Reasoning Models by fine-tuning them to recognize task difficulty and eliminate redundant reasoning, achieving up to 70% cost reduction on easy tasks while maintaining performance.
Authors:Yongjiang Liu, Haoxi Li, Xiaosong Ma, Jie Zhang, Song Guo
Abstract:
Recent Large Reasoning Models (LRMs) excel at complex reasoning tasks but often suffer from overthinking, generating overly long and redundant reasoning trajectories. To explore its essence, our empirical analysis reveals that LRMs are primarily limited to recognizing task properties (i.e., difficulty levels) like humans before solving the problem, leading to a one-size-fits-all reasoning process. Inspired by this, a pressing and natural question emerges: Can we explicitly bootstrap such ability to alleviate overthinking in LRMs? In this paper, we propose Think-How-to-Think (TH2T), a novel two-stage fine-tuning strategy that progressively inspires LRMs' difficulty cognition and redundancy cognition of LRMs. Specifically, we first inject difficulty hypnosis into output prefixes to guide the model toward adaptive reasoning depth, trained on a hybrid dataset mixing short and long reasoning paths. Then, we incorporate redundancy hypnosis, which supervises the intermediate reasoning steps to identify and eliminate unnecessary reasoning patterns. Experiments on 7B/14B/32B models demonstrate that TH2T significantly reduces inference costs by over 70% on easy tasks and 40% on hard tasks while maintaining performance stability. The resulting outputs exhibit clear signs of difficulty-aware capabilities and reduced redundancy (e.g., reflection and looping).
中文摘要:提出的Think-How-to-Think(TH2T)方法通过两阶段微调,使大型推理模型具备任务难度认知与冗余推理识别能力,在保持性能稳定的同时,将简单任务推理成本降低超70%,显著缓解过度思考问题。
English Summary: The proposed Think-How-to-Think (TH2T) method addresses overthinking in Large Reasoning Models by fine-tuning them to recognize task difficulty and eliminate redundant reasoning, achieving up to 70% cost reduction on easy tasks while maintaining performance.
Authors:Zeyu Lei, Hongyuan Yu, Jinlin Wu, Zhen Chen
Abstract:
Precise surgical interventions are vital to patient safety, and advanced enhancement algorithms have been developed to assist surgeons in decision-making. Despite significant progress, these algorithms are typically designed for single tasks in specific scenarios, limiting their effectiveness in complex real-world situations. To address this limitation, we propose SurgVisAgent, an end-to-end intelligent surgical vision agent built on multimodal large language models (MLLMs). SurgVisAgent dynamically identifies distortion categories and severity levels in endoscopic images, enabling it to perform a variety of enhancement tasks such as low-light enhancement, overexposure correction, motion blur elimination, and smoke removal. Specifically, to achieve superior surgical scenario understanding, we design a prior model that provides domain-specific knowledge. Additionally, through in-context few-shot learning and chain-of-thought (CoT) reasoning, SurgVisAgent delivers customized image enhancements tailored to a wide range of distortion types and severity levels, thereby addressing the diverse requirements of surgeons. Furthermore, we construct a comprehensive benchmark simulating real-world surgical distortions, on which extensive experiments demonstrate that SurgVisAgent surpasses traditional single-task models, highlighting its potential as a unified solution for surgical assistance.
中文: SurgVisAgent是一种基于多模态大语言模型的端到端智能手术视觉代理,能动态识别内窥镜图像中的多种失真类型和程度,并通过领域知识和小样本学习实现定制化增强,超越传统单任务模型。
English: SurgVisAgent is an end-to-end surgical vision agent based on multimodal large language models that dynamically identifies and corrects various image distortions in endoscopic procedures, outperforming single-task models through domain-specific knowledge and adaptive reasoning.
Authors:Jiyeon Bae, Hyeon Jeon, Jinwook Seo
Abstract:
Evaluating the accuracy of dimensionality reduction (DR) projections in preserving the structure of high-dimensional data is crucial for reliable visual analytics. Diverse evaluation metrics targeting different structural characteristics have thus been developed. However, evaluations of DR projections can become biased if highly correlated metrics--those measuring similar structural characteristics--are inadvertently selected, favoring DR techniques that emphasize those characteristics. To address this issue, we propose a novel workflow that reduces bias in the selection of evaluation metrics by clustering metrics based on their empirical correlations rather than on their intended design characteristics alone. Our workflow works by computing metric similarity using pairwise correlations, clustering metrics to minimize overlap, and selecting a representative metric from each cluster. Quantitative experiments demonstrate that our approach improves the stability of DR evaluation, which indicates that our workflow contributes to mitigating evaluation bias.
中文: 本研究提出了一种新流程,通过聚类相关度量指标并从每个簇中选取代表指标来减少降维评估中的偏差,从而提升评估的稳定性。
English: This study introduces a novel workflow to reduce bias in dimensionality reduction evaluation by clustering correlated metrics and selecting representatives from each cluster, thereby improving evaluation stability.
Authors:Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma
Abstract:
Crash detection from video feeds is a critical problem in intelligent transportation systems. Recent developments in large language models (LLMs) and vision-language models (VLMs) have transformed how we process, reason about, and summarize multimodal information. This paper surveys recent methods leveraging LLMs for crash detection from video data. We present a structured taxonomy of fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss ongoing challenges and opportunities. Our review provides a foundation for future research in this fast-growing intersection of video understanding and foundation models.
中文: 本文综述了利用大语言模型从视频数据中检测碰撞的最新方法,分析了融合策略、数据集、架构和性能基准,并指出了该领域面临的挑战与机遇。
English: This paper surveys recent methods using large language models for crash detection in video data, analyzing fusion strategies, datasets, architectures, and benchmarks while highlighting challenges and opportunities in the field.
Authors:Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, Jing Cai, Yuexian Zou, Yancheng He, Ying Shan
Abstract:
Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We introduce the In-context Multi-Modal Attention (ICMA) mechanism with learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to correctly handle different task types and distinguish various inputs in polyptych configurations. To bridge the data gap, we carefully curated a high-quality dataset of 12k identity-consistent samples with 8k from real-world sources and 4k from high-quality synthetic data, avoiding the overly glossy and over-saturated synthetic appearance. IC-Custom supports various industrial applications, including try-on, accessory placement, furniture arrangement, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves approximately 73% higher human preference across identity consistency, harmonicity, and text alignment metrics, while training only 0.4% of the original model parameters. Project page: https://liyaowei-stu.github.io/project/IC_Custom
中文: IC-Custom通过上下文学习提出了一个统一框架,无缝整合了位置感知和无位置图像定制,在仅训练少量参数的情况下显著优于现有方法,并支持多种工业应用场景。
English: IC-Custom introduces a unified framework that integrates position-aware and position-free image customization through in-context learning, achieving superior performance with minimal parameter training while supporting diverse industrial applications.
Authors:Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, Jing Cai, Yuexian Zou, Yancheng He, Ying Shan
Abstract:
Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We propose the In-context Multi-Modal Attention (ICMA) mechanism, which employs learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to effectively handle diverse tasks and distinguish between inputs in polyptych configurations. To address the data gap, we curated a 12K identity-consistent dataset with 8K real-world and 4K high-quality synthetic samples, avoiding the overly glossy, oversaturated look typical of synthetic data. IC-Custom supports various industrial applications, including try-on, image insertion, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves about 73\% higher human preference across identity consistency, harmony, and text alignment metrics, while training only 0.4\% of the original model parameters. Project page: https://liyaowei-stu.github.io/project/IC_Custom
中文: IC-Custom通过上下文学习提出了一个统一框架,无缝整合了位置感知和无位置图像定制,在仅训练少量参数的情况下显著优于现有方法,并支持多种工业应用场景。
English: IC-Custom introduces a unified framework that integrates position-aware and position-free image customization through in-context learning, achieving superior performance with minimal parameter training while supporting diverse industrial applications.
Authors:Yuxin Mao, Zhen Qin, Jinxing Zhou, Hui Deng, Xuyang Shen, Bin Fan, Jing Zhang, Yiran Zhong, Yuchao Dai
Abstract:
Autoregressive (AR) models have garnered significant attention in image generation for their ability to effectively capture both local and global structures within visual data. However, prevalent AR models predominantly rely on the transformer architectures, which are beset by quadratic computational complexity concerning input sequence length and substantial memory overhead due to the necessity of maintaining key-value caches. Although linear attention mechanisms have successfully reduced this burden in language models, our initial experiments reveal that they significantly degrade image generation quality because of their inability to capture critical long-range dependencies in visual data. We propose Linear Attention with Spatial-Aware Decay (LASAD), a novel attention mechanism that explicitly preserves genuine 2D spatial relationships within the flattened image sequences by computing position-dependent decay factors based on true 2D spatial location rather than 1D sequence positions. Based on this mechanism, we present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity. Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency, bridging the gap between linear attention's efficiency and spatial understanding needed for high-quality generation.
中文摘要:提出的空间感知衰减线性注意力(LASAD)机制解决了线性注意力在图像生成中难以捕捉空间依赖关系的缺陷,使LASADGen能够以线性计算复杂度实现最先进的生成性能。
English Summary: The proposed Linear Attention with Spatial-Aware Decay (LASAD) mechanism overcomes linear attention's limitations in capturing spatial dependencies for image generation, enabling LASADGen to achieve state-of-the-art performance with linear computational complexity.
Authors:Liliang Ye, Yunyao Zhang, Yafeng Wu, Yi-Ping Phoebe Chen, Junqing Yu, Wei Yang, Zikai Song
Abstract:
Social media platforms serve as central hubs for content dissemination, opinion expression, and public engagement across diverse modalities. Accurately predicting the popularity of social media videos enables valuable applications in content recommendation, trend detection, and audience engagement. In this paper, we present Multimodal Video Predictor (MVP), our winning solution to the Video Track of the SMP Challenge 2025. MVP constructs expressive post representations by integrating deep video features extracted from pretrained models with user metadata and contextual information. The framework applies systematic preprocessing techniques, including log-transformations and outlier removal, to improve model robustness. A gradient-boosted regression model is trained to capture complex patterns across modalities. Our approach ranked first in the official evaluation of the Video Track, demonstrating its effectiveness and reliability for multimodal video popularity prediction on social platforms. The source code is available at https://anonymous.4open.science/r/SMPDVideo.
中文: 多模态视频预测器(MVP)通过融合视频特征、用户元数据和上下文信息,结合预处理和梯度提升回归模型,有效预测社交媒体视频的流行度,荣获SMP挑战赛2025视频赛道冠军。
English: The Multimodal Video Predictor (MVP) integrates video features, user metadata, and contextual data with preprocessing and gradient-boosted regression to effectively predict social media video popularity, winning the SMP Challenge 2025 Video Track.
Authors:Liliang Ye, Yunyao Zhang, Yafeng Wu, Yi-Ping Phoebe Chen, Junqing Yu, Wei Yang, Zikai Song
Abstract:
Social media popularity prediction plays a crucial role in content optimization, marketing strategies, and user engagement enhancement across digital platforms. However, predicting post popularity remains challenging due to the complex interplay between visual, textual, temporal, and user behavioral factors. This paper presents HyperFusion, a hierarchical multimodal ensemble learning framework for social media popularity prediction. Our approach employs a three-tier fusion architecture that progressively integrates features across abstraction levels: visual representations from CLIP encoders, textual embeddings from transformer models, and temporal-spatial metadata with user characteristics. The framework implements a hierarchical ensemble strategy combining CatBoost, TabNet, and custom multi-layer perceptrons. To address limited labeled data, we propose a two-stage training methodology with pseudo-labeling and iterative refinement. We introduce novel cross-modal similarity measures and hierarchical clustering features that capture inter-modal dependencies. Experimental results demonstrate that HyperFusion achieves competitive performance on the SMP challenge dataset. Our team achieved third place in the SMP Challenge 2025 (Image Track). The source code is available at https://anonymous.4open.science/r/SMPDImage.
Chinese: 本文提出HyperFusion分层多模态框架,通过集成学习融合视觉、文本、时空及用户特征来有效预测社交媒体热度,在SMP 2025挑战赛中荣获第三名。
English: This paper introduces HyperFusion, a hierarchical multimodal framework that integrates visual, textual, temporal, and user features through ensemble learning to effectively predict social media popularity, achieving third place in the SMP Challenge 2025.
Authors:Linghui Zhu, Yiming Li, Haiqin Weng, Yan Liu, Tianwei Zhang, Shu-Tao Xia, Zhi Wang
Abstract:
Large vision models achieve remarkable performance in various downstream tasks, primarily by personalizing pre-trained models through fine-tuning with private and valuable local data, which makes the personalized model a valuable intellectual property for its owner. Similar to the era of traditional DNNs, model stealing attacks also pose significant risks to these personalized models. However, in this paper, we reveal that most existing defense methods (developed for traditional DNNs), typically designed for models trained from scratch, either introduce additional security risks, are prone to misjudgment, or are even ineffective for fine-tuned models. To alleviate these problems, this paper proposes a harmless model ownership verification method for personalized models by decoupling similar common features. In general, our method consists of three main stages. In the first stage, we create shadow models that retain common features of the victim model while disrupting dataset-specific features. We represent the dataset-specific features of the victim model by the output differences between the shadow and victim models. After that, a meta-classifier is trained to identify stolen models by determining whether suspicious models contain the dataset-specific features of the victim. In the third stage, we conduct model ownership verification by hypothesis test to mitigate randomness and enhance robustness. Extensive experiments on benchmark datasets verify the effectiveness of the proposed method in detecting different types of model stealing simultaneously.
中文: 本文提出了一种无害的个性化模型所有权验证方法,通过分离通用特征与数据集特定特征,利用影子模型和元分类器有效检测多种模型窃取攻击,同时降低安全风险和误判。
English: This paper introduces a harmless model ownership verification method for personalized vision models by decoupling common and dataset-specific features, using shadow models and a meta-classifier to effectively detect various model stealing attacks while mitigating risks and misjudgments.
Authors:Jacopo Nudo, Mario Edoardo Pandolfo, Edoardo Loru, Mattia Samory, Matteo Cinelli, Walter Quattrociocchi
Abstract:
We investigate how Large Language Models (LLMs) behave when simulating political discourse on social media. Leveraging 21 million interactions on X during the 2024 U.S. presidential election, we construct LLM agents based on 1,186 real users, prompting them to reply to politically salient tweets under controlled conditions. Agents are initialized either with minimal ideological cues (Zero Shot) or recent tweet history (Few Shot), allowing one-to-one comparisons with human replies. We evaluate three model families (Gemini, Mistral, and DeepSeek) across linguistic style, ideological consistency, and toxicity. We find that richer contextualization improves internal consistency but also amplifies polarization, stylized signals, and harmful language. We observe an emergent distortion that we call "generation exaggeration": a systematic amplification of salient traits beyond empirical baselines. Our analysis shows that LLMs do not emulate users, they reconstruct them. Their outputs, indeed, reflect internal optimization dynamics more than observed behavior, introducing structural biases that compromise their reliability as social proxies. This challenges their use in content moderation, deliberative simulations, and policy modeling.
中文: 研究发现,大语言模型在模拟政治讨论时会通过"生成夸张"效应加剧极化与有害内容,其输出更多反映算法内部优化而非真实用户行为,这削弱了其作为社会模拟工具的可靠性。
English: This study reveals that LLMs simulating political discourse amplify polarization and harmful content through "generation exaggeration," reflecting internal algorithmic biases rather than genuine user behavior, which undermines their reliability for social applications.
Authors:Ming Li, Chenguang Wang, Yijun Liang, Xiyao Wang, Yuhang Zhou, Xiyang Wu, Yuqing Zhang, Ruiyi Zhang, Tianyi Zhou
Abstract:
Recent agentic Multi-Modal Large Language Models (MLLMs) such as GPT-o3 have achieved near-ceiling scores on various existing benchmarks, motivating a demand for more challenging test tasks. These MLLMs have been reported to excel in a few expert-level tasks for humans, e.g., GeoGuesser, reflecting their potential as a detective who can notice minuscule cues in an image and weave them into coherent, situational explanations, leading to a reliable answer. But can they match the performance of excellent human detectives? To answer this question, we investigate some hard scenarios where GPT-o3 can still handle, and find a common scenario where o3's performance drops to nearly zero, which we name CaughtCheating. It is inspired by the social media requests that ask others to detect suspicious clues from photos shared by the poster's partner. We conduct extensive experiments and analysis to understand why existing MLLMs lack sufficient capability to solve this kind of task. CaughtCheating provides a class of challenging visual perception and reasoning tasks with great value and practical usage. Success in these tasks paves the way for MLLMs to acquire human-level detective perception and reasoning capabilities.
中文摘要:尽管GPT-o3等新型多模态大语言模型在多项测试中表现出色,但在"抓包"任务中表现骤降,暴露了其在视觉感知与推理能力上与人类侦探的差距。
English Summary: Recent advanced MLLMs like GPT-o3 excel in many benchmarks but struggle significantly with the CaughtCheating task, revealing limitations in visual perception and reasoning compared to human detectives.
Authors:Wenbing Li, Zikai Song, Hang Zhou, Yunyao Zhang, Junqing Yu, Wei Yang
Abstract:
Recent efforts to combine low-rank adaptation (LoRA) with mixture-of-experts (MoE) for adapting large language models (LLMs) to multiple tasks still exhibit prevailing limitations: they either swap entire attention/feed-forward layers for switch experts or bolt on parallel expert branches, diluting parameter efficiency and task fidelity. We propose the LoRA-Mixer, a modular and lightweight MoE framework that integrates LoRA experts. Our core innovation lies in replacing the projection matrices of the attention module's input/output linear layers with dynamically routed, task-specific LoRA experts. This design ensures seamless compatibility with diverse foundation models, including transformers and state space models (SSMs), by leveraging their inherent linear projection structures. The framework supports two operational paradigms: (1) joint optimization of LoRA experts and routing mechanisms via a novel hard-soft routing strategy, or (2) direct deployment of pre-trained, frozen LoRA modules sourced from external repositories. To enable robust router training with limited data while ensuring stable routing decisions and maximizing expert reuse, we introduce an adaptive Specialization Balance Loss (SBL) that jointly optimizes expert balance and task-specific alignment. Extensive experiments on seven benchmark datasets, including MedQA, CoLA, SST-2, GSM8K, ARC-E, ARC-C, and HumanEval, demonstrate the effectiveness of LoRA-Mixer. On datasets such as GSM8K, HumanEval, and MedQA, LoRA-Mixer achieves significant improvements of 7.61%, 4.88%, and 3.08% over the base models, respectively. Compared with state-of-the-art methods, LoRA-Mixer achieves additional improvements of 1.09%, 1.45%, and 1.68%, respectively, using only 48% of the parameters, demonstrating its efficiency and strong performance.
Chinese: LoRA-Mixer框架通过动态路由将LoRA专家集成到注意力模块的线性层中,在多个基准测试中显著提升了参数效率和任务性能,优于基础模型和现有先进方法。
English: The LoRA-Mixer framework integrates LoRA experts into the attention module's linear layers via dynamic routing, enhancing parameter efficiency and task performance across multiple benchmarks with significant improvements over base models and state-of-the-art methods.
Authors:Chengwei Xia, Fan Ma, Ruijie Quan, Kun Zhan, Yi Yang
Abstract:
This paper addresses the challenge of generating adversarial image using a diffusion model to deceive multimodal large language models (MLLMs) into generating the targeted responses, while avoiding significant distortion of the clean image. To address the above challenges, we propose an adversarial-guided diffusion (AGD) approach for adversarial attack MLLMs. We introduce adversarial-guided noise to ensure attack efficacy. A key observation in our design is that, unlike most traditional adversarial attacks which embed high-frequency perturbations directly into the clean image, AGD injects target semantics into the noise component of the reverse diffusion. Since the added noise in a diffusion model spans the entire frequency spectrum, the adversarial signal embedded within it also inherits this full-spectrum property. Importantly, during reverse diffusion, the adversarial image is formed as a linear combination of the clean image and the noise. Thus, when applying defenses such as a simple low-pass filtering, which act independently on each component, the adversarial image within the noise component is less likely to be suppressed, as it is not confined to the high-frequency band. This makes AGD inherently robust to variety defenses. Extensive experiments demonstrate that our AGD outperforms state-of-the-art methods in attack performance as well as in model robustness to some defenses.
中文摘要:本文提出一种对抗引导扩散方法,通过将对抗信号嵌入扩散模型的全频谱噪声中,既能有效欺骗多模态大语言模型生成目标响应,又能保持图像质量并抵抗低通滤波等防御措施。
English Summary: This paper introduces an adversarial-guided diffusion (AGD) method that embeds adversarial signals into the full-spectrum noise of diffusion models to effectively deceive multimodal large language models while maintaining image quality and resisting defenses like low-pass filtering.
Authors:Luyu Chen, Quanyu Dai, Zeyu Zhang, Xueyang Feng, Mingyu Zhang, Pengcheng Tang, Xu Chen, Yue Zhu, Zhenhua Dong
Abstract:
Conversational recommender systems (CRS) enhance user experience through multi-turn interactions, yet evaluating CRS remains challenging. User simulators can provide comprehensive evaluations through interactions with CRS, but building realistic and diverse simulators is difficult. While recent work leverages large language models (LLMs) to simulate user interactions, they still fall short in emulating individual real users across diverse scenarios and lack explicit rating mechanisms for quantitative evaluation. To address these gaps, we propose RecUserSim, an LLM agent-based user simulator with enhanced simulation realism and diversity while providing explicit scores. RecUserSim features several key modules: a profile module for defining realistic and diverse user personas, a memory module for tracking interaction history and discovering unknown preferences, and a core action module inspired by Bounded Rationality theory that enables nuanced decision-making while generating more fine-grained actions and personalized responses. To further enhance output control, a refinement module is designed to fine-tune final responses. Experiments demonstrate that RecUserSim generates diverse, controllable outputs and produces realistic, high-quality dialogues, even with smaller base LLMs. The ratings generated by RecUserSim show high consistency across different base LLMs, highlighting its effectiveness for CRS evaluation.
中文摘要:RecUserSim是一种基于大语言模型的先进用户模拟器,通过整合用户画像、记忆模块和有限理性决策机制,显著提升了对话推荐系统评估的真实性与多样性,同时能提供稳定可靠的显式评分。
English Summary: RecUserSim is an advanced LLM-based user simulator that enhances realism and diversity in conversational recommender system evaluations by incorporating user profiles, memory, and bounded rationality-driven decision-making, while providing consistent explicit ratings.
Authors:Clinton Ansun Mo, Kun Hu, Chengjiang Long, Dong Yuan, Wan-Chi Siu, Zhiyong Wang
Abstract:
Motion skeletons drive 3D character animation by transforming bone hierarchies, but differences in proportions or structure make motion data hard to transfer across skeletons, posing challenges for data-driven motion synthesis. Temporal Point Clouds (TPCs) offer an unstructured, cross-compatible motion representation. Though reversible with skeletons, TPCs mainly serve for compatibility, not for direct motion task learning. Doing so would require data synthesis capabilities for the TPC format, which presents unexplored challenges regarding its unique temporal consistency and point identifiability. Therefore, we propose PUMPS, the primordial autoencoder architecture for TPC data. PUMPS independently reduces frame-wise point clouds into sampleable feature vectors, from which a decoder extracts distinct temporal points using latent Gaussian noise vectors as sampling identifiers. We introduce linear assignment-based point pairing to optimise the TPC reconstruction process, and negate the use of expensive point-wise attention mechanisms in the architecture. Using these latent features, we pre-train a motion synthesis model capable of performing motion prediction, transition generation, and keyframe interpolation. For these pre-training tasks, PUMPS performs remarkably well even without native dataset supervision, matching state-of-the-art performance. When fine-tuned for motion denoising or estimation, PUMPS outperforms many respective methods without deviating from its generalist architecture.
中文摘要:运动骨架因结构差异难以跨模型传输动画数据,而时序点云提供了兼容性表示,所提出的PUMPS自编码器无需数据集监督即可实现高效运动合成任务。
English Summary: Motion skeletons face challenges in transferring animation data due to structural differences, but Temporal Point Clouds (TPCs) provide a cross-compatible representation, with the proposed PUMPS autoencoder enabling effective motion synthesis tasks without requiring dataset supervision.
Authors:Yi Song, Hao Xu, Kai Wan, Kai-Kit Wong, Giuseppe Caire, Shlomo Shamai
Abstract:
A recent trend in wireless communications considers the migration of traditional monolithic base stations to the so-called disaggregated architecture, where radio units (RUs) implement only the low-level physical layer functionalities such as demodulation, and A/D conversion, while the high-level physical layer, such as channel decoding, is implemented as software-defined functions running on general-purpose hardware in some remote central processing unit (CP). The corresponding information theoretic model for the uplink (from the wireless users to the CP) is a multiaccess-relay channel with primitive oblivious relays. The relays (RUs) are oblivious, as they are agnostic of the users codebooks, and primitive, since the fronthaul links (from RUs to CP) are error-free with limited capacity. This class of networks has been intensely studied in the information theoretic literature, where several approximated or exact (under certain conditions) capacity results have been derived. In particular, in the Gaussian case, the model has been analyzed for fixed and known channel state. This paper is motivated by the fact that, in practice, the channel state is a random process, and it is estimated at the base station side through uplink pilot symbols sent by the users. The pilot dimension may take up a large portion of the channel coherence block, i.e., the number of symbols over which the channel state remains approximately constant. Hence, sending both pilot and data symbols from the relays to the CP may require a significant overhead, especially when the fronthaul capacity is small. As a prototypical problem, we consider the ergodic achievable rate for a diamond network formed by a single user and two relays where the channel state is known at the relays, but not known at the CP.
Chinese: 本文研究具有两个无意识中继的钻石网络的遍历可达速率,重点关注信道状态随机且仅中继已知时,信道估计开销对性能的影响。
English: This paper studies the ergodic achievable rate in a diamond network with two oblivious relays, focusing on the impact of channel state estimation overhead when the channel is random and known only at the relays, not at the central processor.
Authors:Weigang Lu, Ziyu Guan, Wei Zhao, Yaming Yang, Yujie Sun, Zheng Liang, Yibing Zhan, Dapeng Tao
Abstract:
GNN-to-MLP (G2M) methods have emerged as a promising approach to accelerate Graph Neural Networks (GNNs) by distilling their knowledge into simpler Multi-Layer Perceptrons (MLPs). These methods bridge the gap between the expressive power of GNNs and the computational efficiency of MLPs, making them well-suited for resource-constrained environments. However, existing G2M methods are limited by their inability to flexibly adjust inference cost and accuracy dynamically, a critical requirement for real-world applications where computational resources and time constraints can vary significantly. To address this, we introduce a Progressive framework designed to offer flexible and on-demand trade-offs between inference cost and accuracy for GNN-to-MLP knowledge distillation (ProGMLP). ProGMLP employs a Progressive Training Structure (PTS), where multiple MLP students are trained in sequence, each building on the previous one. Furthermore, ProGMLP incorporates Progressive Knowledge Distillation (PKD) to iteratively refine the distillation process from GNNs to MLPs, and Progressive Mixup Augmentation (PMA) to enhance generalization by progressively generating harder mixed samples. Our approach is validated through comprehensive experiments on eight real-world graph datasets, demonstrating that ProGMLP maintains high accuracy while dynamically adapting to varying runtime scenarios, making it highly effective for deployment in diverse application settings.
中文: ProGMLP提出了一种渐进式GNN到MLP知识蒸馏框架,通过顺序训练和迭代优化技术,实现了推理成本与准确性的灵活权衡。
English: ProGMLP introduces a progressive framework for GNN-to-MLP knowledge distillation, enabling flexible trade-offs between inference cost and accuracy through sequential training and iterative refinement techniques.
Authors:Xiaojie Zhang, Yuanfei Wang, Ruihai Wu, Kunqi Xu, Yu Li, Liuyu Xiang, Hao Dong, Zhaofeng He
Abstract:
Articulated objects pose diverse manipulation challenges for robots. Since their internal structures are not directly observable, robots must adaptively explore and refine actions to generate successful manipulation trajectories. While existing works have attempted cross-category generalization in adaptive articulated object manipulation, two major challenges persist: (1) the geometric diversity of real-world articulated objects complicates visual perception and understanding, and (2) variations in object functions and mechanisms hinder the development of a unified adaptive manipulation strategy. To address these challenges, we propose AdaRPG, a novel framework that leverages foundation models to extract object parts, which exhibit greater local geometric similarity than entire objects, thereby enhancing visual affordance generalization for functional primitive skills. To support this, we construct a part-level affordance annotation dataset to train the affordance model. Additionally, AdaRPG utilizes the common knowledge embedded in foundation models to reason about complex mechanisms and generate high-level control codes that invoke primitive skill functions based on part affordance inference. Simulation and real-world experiments demonstrate AdaRPG's strong generalization ability across novel articulated object categories.
中文摘要:提出的AdaRPG框架通过利用基础模型提取几何相似性更高的物体部件来增强功能原语技能的视觉可供性泛化,并运用模型内嵌知识推理复杂机制生成高层控制指令,有效解决了关节物体操作中的几何多样性和功能差异难题。
English Summary: The proposed AdaRPG framework addresses challenges in articulated object manipulation by using foundation models to extract geometrically similar object parts for improved affordance generalization and leveraging embedded knowledge to generate adaptive control strategies.
Authors:Bozhou Zhang, Nan Song, Xiatian Zhu, Li Zhang
Abstract:
Motion forecasting and planning are tasked with estimating the trajectories of traffic agents and the ego vehicle, respectively, to ensure the safety and efficiency of autonomous driving systems in dynamically changing environments. State-of-the-art methods typically adopt a one-query-one-trajectory paradigm, where each query corresponds to a unique trajectory for predicting multi-mode trajectories. While this paradigm can produce diverse motion intentions, it often falls short in modeling the intricate spatiotemporal evolution of trajectories, which can lead to collisions or suboptimal outcomes. To overcome this limitation, we propose DeMo++, a framework that decouples motion estimation into two distinct components: holistic motion intentions to capture the diverse potential directions of movement, and fine spatiotemporal states to track the agent's dynamic progress within the scene and enable a self-refinement capability. Further, we introduce a cross-scene trajectory interaction mechanism to explore the relationships between motions in adjacent scenes. This allows DeMo++ to comprehensively model both the diversity of motion intentions and the spatiotemporal evolution of each trajectory. To effectively implement this framework, we developed a hybrid model combining Attention and Mamba. This architecture leverages the strengths of both mechanisms for efficient scene information aggregation and precise trajectory state sequence modeling. Extensive experiments demonstrate that DeMo++ achieves state-of-the-art performance across various benchmarks, including motion forecasting (Argoverse 2 and nuScenes), motion planning (nuPlan), and end-to-end planning (NAVSIM).
中文: DeMo++框架通过将运动估计解耦为整体意图与精细时空状态,结合跨场景交互机制和混合注意力-Mamba模型,在多个基准测试中实现最优性能,从而提升自动驾驶系统的安全性与效率。
English: DeMo++ is a novel framework that enhances autonomous driving safety and efficiency by decoupling motion estimation into holistic intentions and fine spatiotemporal states, incorporating cross-scene interaction and a hybrid Attention-Mamba model to achieve state-of-the-art performance across multiple benchmarks.
Authors:Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai, Bohan Zhuang
Abstract:
Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of large language models by encouraging step-by-step intermediate reasoning during inference. While effective, CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences. Existing acceleration strategies either reduce sequence length through early stopping or compressive reward designs, or improve decoding speed via speculative decoding with smaller models. However, speculative decoding suffers from limited speedup when the agreement between small and large models is low, and fails to exploit the potential advantages of small models in producing concise intermediate reasoning. In this paper, we present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference by switching between a small language model (SLM) and a large language model (LLM) along the reasoning trajectory. R-Stitch uses the SLM to generate tokens by default and delegates to the LLM only when the SLM's confidence falls below a threshold. This design avoids full-sequence rollback and selectively invokes the LLM on uncertain steps, preserving both efficiency and answer quality. R-Stitch is model-agnostic, training-free, and compatible with standard decoding pipelines. Experiments on math reasoning benchmarks demonstrate that R-Stitch achieves up to 85\% reduction in inference latency with negligible accuracy drop, highlighting its practical effectiveness in accelerating CoT reasoning.
Chinese: R-Stitch是一种无需训练的混合解码框架,通过基于令牌熵的计算分配机制在小模型与大模型间动态调度,在保持精度的同时显著加速推理过程,实现了高达4.1倍的峰值加速比。
English: R-Stitch is a training-free hybrid decoding framework that uses token-level entropy to delegate computation between small and large language models, achieving substantial inference acceleration with minimal accuracy loss by reducing both per-token complexity and token count.
Authors:Zhuokun Chen, Zeren Chen, Jiahao He, Lu Sheng, Mingkui Tan, Jianfei Cai, Bohan Zhuang
Abstract:
Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch$^{+}$, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00$\times$ on DeepSeek-R1-Distill-Qwen-7B, 3.85$\times$ on 14B, and 4.10$\times$ on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency--accuracy trade-offs that can be tailored to diverse computational budgets without retraining.
Chinese: R-Stitch是一种无需训练的混合解码框架,通过基于令牌熵的计算分配机制在小模型与大模型间动态调度,在保持精度的同时显著加速推理过程,实现了高达4.1倍的峰值加速比。
English: R-Stitch is a training-free hybrid decoding framework that uses token-level entropy to delegate computation between small and large language models, achieving substantial inference acceleration with minimal accuracy loss by reducing both per-token complexity and token count.
Authors:Chengjie Ge, Yufeng Peng, Xueyang Fu, Qiyu Kang, Xuhao Li, Qixin Zhang, Junhao Ren, Zheng-Jun Zha
Abstract:
Spiking Neural Networks (SNNs) draw inspiration from biological neurons to create realistic models for brain-like computation, demonstrating effectiveness in processing temporal information with energy efficiency and biological realism. Most existing SNNs assume a single time constant for neuronal membrane voltage dynamics, modeled by first-order ordinary differential equations (ODEs) with Markovian characteristics. Consequently, the voltage state at any time depends solely on its immediate past value, potentially limiting network expressiveness. Real neurons, however, exhibit complex dynamics influenced by long-term correlations and fractal dendritic structures, suggesting non-Markovian behavior. Motivated by this, we propose the Fractional SPIKE Differential Equation neural network (fspikeDE), which captures long-term dependencies in membrane voltage and spike trains through fractional-order dynamics. These fractional dynamics enable more expressive temporal patterns beyond the capability of integer-order models. For efficient training of fspikeDE, we introduce a gradient descent algorithm that optimizes parameters by solving an augmented fractional-order ODE (FDE) backward in time using adjoint sensitivity methods. Extensive experiments on diverse image and graph datasets demonstrate that fspikeDE consistently outperforms traditional SNNs, achieving superior accuracy, comparable energy efficiency, reduced training memory usage, and enhanced robustness against noise. Our approach provides a novel open-sourced computational toolbox for fractional-order SNNs, widely applicable to various real-world tasks.
中文摘要:提出的fspikeDE模型通过引入分数阶动力学来捕捉神经活动的长期依赖性,在多个数据集上实现了比传统脉冲神经网络更高的精度、能效和鲁棒性。
English Summary: The proposed fspikeDE model introduces fractional-order dynamics to Spiking Neural Networks, capturing long-term dependencies in neural activity and outperforming traditional SNNs in accuracy, energy efficiency, and robustness across multiple datasets.
Authors:Boyu Qiao, Kun Li, Wei Zhou, Songlin Hu
Abstract:
In the human-bot symbiotic information ecosystem, social bots play key roles in spreading and correcting disinformation. Understanding their influence is essential for risk control and better governance. However, current studies often rely on simplistic user and network modeling, overlook the dynamic behavior of bots, and lack quantitative evaluation of correction strategies. To fill these gaps, we propose MADD, a Multi Agent based framework for Disinformation Dissemination. MADD constructs a more realistic propagation network by integrating the Barabasi Albert Model for scale free topology and the Stochastic Block Model for community structures, while designing node attributes based on real world user data. Furthermore, MADD incorporates both malicious and legitimate bots, with their controlled dynamic participation allows for quantitative analysis of correction strategies. We evaluate MADD using individual and group level metrics. We experimentally verify the real world consistency of MADD user attributes and network structure, and we simulate the dissemination of six disinformation topics, demonstrating the differential effects of fact based and narrative based correction strategies.
中文摘要:MADD框架通过模拟真实社交网络和动态机器人行为,填补了现有虚假信息研究的空白,实现了对辟谣策略的定量评估。
English Summary: The MADD framework addresses limitations in current disinformation studies by simulating realistic social networks and dynamic bot behaviors to quantitatively evaluate correction strategies.
Authors:Leixian Shen, Leni Yang, Haotian Li, Yun Wang, Yuyu Luo, Huamin Qu
Abstract:
Empirical research in creative design deepens our theoretical understanding of design principles and perceptual effects, offering valuable guidance for innovating creation tools. However, how these empirical insights currently influence the development of creation tools, and how their integration can be enhanced in the future, remains insufficiently understood. In this paper, we aim to unveil the gap through a case study on data videos, a prominent and wide-spread medium for effective data storytelling. To achieve the goal, we conducted a comprehensive analysis of 46 empirical research papers and 48 creation tool papers on data video, complemented by interviews with 11 experts. Building upon a systematic collection and structured characterization of empirical research by their methodologies (e.g., corpus analysis, comparative evaluations) and component focus (e.g., visuals, motions, narratives, audio), we conducted a context-aware citation analysis and revealed a taxonomy of recurring patterns in how empirical findings inform tool design across citation functions (e.g., problem framing, technical reference). Expert interviews further uncovered researchers' practice patterns in applying empirical findings (e.g., adaptation, synthesis, iteration, etc.) and identified key factors influencing applicability, such as contextual relevance, granularity matching, clarity, credibility, and feasibility. Finally, we derive suggestions and discuss future opportunities to foster closer mutual engagement between empirical and tool research, aiming to reinforce the theoretical grounding of creation tools and enhance the practical impact of empirical research.
中文摘要:本研究通过数据视频案例揭示了实证研究与创作工具开发之间的应用鸿沟,系统分析了引用模式与专家实践,并提出加强两者协同发展的具体路径。
English Summary: This study investigates the gap between empirical design research and creation tool development through data video analysis, revealing citation patterns and expert practices while proposing strategies to strengthen their mutual integration.
Authors:Shikhar Bharadwaj, Samuele Cornell, Kwanghee Choi, Satoru Fukayama, Hye-jin Shim, Soham Deshmukh, Shinji Watanabe
Abstract:
Masked token prediction has emerged as a powerful pre-training objective across language, vision, and speech, offering the potential to unify these diverse modalities through a single pre-training task. However, its application for general audio understanding remains underexplored, with BEATs being the only notable example. BEATs has seen limited modifications due to the absence of open-source pre-training code. Furthermore, BEATs was trained only on AudioSet, restricting its broader downstream applicability. To address these gaps, we present OpenBEATs, an open-source framework that extends BEATs via multi-domain audio pre-training. We conduct comprehensive evaluations across six types of tasks, twenty five datasets, and three audio domains, including audio reasoning tasks such as audio question answering, entailment, and captioning. OpenBEATs achieves state-of-the-art performance on six bioacoustics datasets, two environmental sound datasets and five reasoning datasets, performing better than models exceeding a billion parameters at one-fourth their parameter size. These results demonstrate the effectiveness of multi-domain datasets and masked token prediction task to learn general-purpose audio representations. To promote further research and reproducibility, we release all pre-training and evaluation code, pretrained and fine-tuned checkpoints, and training logs at https://shikhar-s.github.io/OpenBEATs
中文: OpenBEATs是一个开源框架,通过多领域音频预训练扩展了BEATs,在多种音频任务中实现了最先进的性能,同时证明了掩码标记预测在学习通用音频表示方面的有效性。
English: OpenBEATs is an open-source framework that extends BEATs through multi-domain audio pre-training, achieving state-of-the-art performance across diverse audio tasks while demonstrating the effectiveness of masked token prediction for learning general-purpose audio representations.
Authors:Youlin Wu, Yuanyuan Sun, Xiaokun Zhang, Haoxi Zhan, Bo Xu, Liang Yang, Hongfei Lin
Abstract:
News recommender systems aim to provide personalized news reading experiences for users based on their reading history. Behavioral science studies suggest that screen-based news reading contains three successive steps: scanning, title reading, and then clicking. Adhering to these steps, we find that intra-news entity interest dominates the scanning stage, while the inter-news entity interest guides title reading and influences click decisions. Unfortunately, current methods overlook the unique utility of entities in news recommendation. To this end, we propose a novel method called IP2 to probe entity-guided reading interest at both intra- and inter-news levels. At the intra-news level, a Transformer-based entity encoder is devised to aggregate mentioned entities in the news title into one signature entity. Then, a signature entity-title contrastive pre-training is adopted to initialize entities with proper meanings using the news story context, which in the meantime facilitates us to probe for intra-news entity interest. As for the inter-news level, a dual tower user encoder is presented to capture inter-news reading interest from both the title meaning and entity sides. In addition to highlighting the contribution of inter-news entity guidance, a cross-tower attention link is adopted to calibrate title reading interest using inter-news entity interest, thus further aligning with real-world behavior. Extensive experiments on two real-world datasets demonstrate that our IP2 achieves state-of-the-art performance in news recommendation.
Chinese: 本文提出的IP2方法通过在新闻内和新闻间两个层面建模实体引导的阅读兴趣,利用签名实体编码和跨塔注意力机制,实现了新闻推荐领域的最先进性能。
English: The proposed IP2 method enhances news recommendation by modeling entity-guided reading interests at both intra- and inter-news levels, achieving state-of-the-art performance through signature entity encoding and cross-tower attention mechanisms.
Authors:Qihang Li, Jichen Yang, Yaqian Chen, Yuwen Chen, Hanxue Gu, Lars J. Grimm, Maciej A. Mazurowski
Abstract:
Breast MRI provides high-resolution imaging critical for breast cancer screening and preoperative staging. However, existing segmentation methods for breast MRI remain limited in scope, often focusing on only a few anatomical structures, such as fibroglandular tissue or tumors, and do not cover the full range of tissues seen in scans. This narrows their utility for quantitative analysis. In this study, we present BreastSegNet, a multi-label segmentation algorithm for breast MRI that covers nine anatomical labels: fibroglandular tissue (FGT), vessel, muscle, bone, lesion, lymph node, heart, liver, and implant. We manually annotated a large set of 1123 MRI slices capturing these structures with detailed review and correction from an expert radiologist. Additionally, we benchmark nine segmentation models, including U-Net, SwinUNet, UNet++, SAM, MedSAM, and nnU-Net with multiple ResNet-based encoders. Among them, nnU-Net ResEncM achieves the highest average Dice scores of 0.694 across all labels. It performs especially well on heart, liver, muscle, FGT, and bone, with Dice scores exceeding 0.73, and approaching 0.90 for heart and liver. All model code and weights are publicly available, and we plan to release the data at a later date.
中文:BreastSegNet是一种针对乳腺MRI的多标签分割算法,可识别九种解剖结构,其中nnU-Net ResEncM模型实现了最高分割精度,在多个组织上表现出优异性能。
English: BreastSegNet is a multi-label segmentation algorithm for breast MRI that identifies nine anatomical structures, with nnU-Net ResEncM achieving the highest accuracy and demonstrating strong performance across multiple tissues.
Authors:Siqi Shen, Mehar Singh, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Rada Mihalcea
Abstract:
There has been extensive research on assessing the value orientation of Large Language Models (LLMs) as it can shape user experiences across demographic groups. However, several challenges remain. First, while the Multiple Choice Question (MCQ) setting has been shown to be vulnerable to perturbations, there is no systematic comparison of probing methods for value probing. Second, it is unclear to what extent the probed values capture in-context information and reflect models' preferences for real-world actions. In this paper, we evaluate the robustness and expressiveness of value representations across three widely used probing strategies. We use variations in prompts and options, showing that all methods exhibit large variances under input perturbations. We also introduce two tasks studying whether the values are responsive to demographic context, and how well they align with the models' behaviors in value-related scenarios. We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions. Our work highlights the need for a more careful examination of LLM value probing and awareness of its limitations.
中文: 本研究评估了大语言模型价值表征的稳健性与表达力,发现其对输入扰动高度敏感,且探测出的价值观与模型在现实场景中的行为偏好关联微弱。
English: This study evaluates the robustness and expressiveness of value representations in Large Language Models, revealing significant sensitivity to input perturbations and weak alignment between probed values and real-world behavioral preferences.
Authors:Shijun Guo, Haoran Xu, Yaming Yang, Ziyu Guan, Wei Zhao, Xinyi Zhang, Yishan Song, Jiwei Chen
Abstract:
The openness of social media enables the free exchange of opinions, but it also presents challenges in guiding opinion evolution towards global consensus. Existing methods often directly modify user views or enforce cross-group connections. These intrusive interventions undermine user autonomy, provoke psychological resistance, and reduce the efficiency of global consensus. Additionally, due to the lack of a long-term perspective, promoting local consensus often exacerbates divisions at the macro level. To address these issues, we propose the hierarchical, non-intrusive opinion guidance framework, H-NeiFi. It first establishes a two-layer dynamic model based on social roles, considering the behavioral characteristics of both experts and non-experts. Additionally, we introduce a non-intrusive neighbor filtering method that adaptively controls user communication channels. Using multi-agent reinforcement learning (MARL), we optimize information propagation paths through a long-term reward function, avoiding direct interference with user interactions. Experiments show that H-NeiFi increases consensus speed by 22.0% to 30.7% and maintains global convergence even in the absence of experts. This approach enables natural and efficient consensus guidance by protecting user interaction autonomy, offering a new paradigm for social network governance.
中文摘要:提出的H-NeiFi框架通过基于角色的双层动态建模和非侵入式邻居过滤方法,在不直接干预用户交互的前提下,利用多智能体强化学习优化信息传播路径,使共识形成速度提升22.0%-30.7%并保持全局收敛性。
English Summary: The proposed H-NeiFi framework uses role-based modeling and adaptive neighbor filtering to guide opinion evolution non-intrusively, accelerating consensus by 22.0%-30.7% while preserving user autonomy through multi-agent reinforcement learning.
Authors:Lotfi El Hafi, Kazuma Onishi, Shoichi Hasegawa, Akira Oyama, Tomochika Ishikawa, Masashi Osada, Carl Tornberg, Ryoma Kado, Kento Murata, Saki Hashimoto, Sebastian Carrera Villalobos, Akira Taniguchi, Gustavo Alfonso Garcia Ricardez, Yoshinobu Hagiwara, Tatsuya Aoki, Kensuke Iwata, Takato Horii, Yukiko Horikawa, Takahiro Miyashita, Tadahiro Taniguchi, Hiroshi Ishiguro
Abstract:
Cybernetic avatars (CAs) are key components of an avatar-symbiotic society, enabling individuals to overcome physical limitations through virtual agents and robotic assistants. While semi-autonomous CAs intermittently require human teleoperation and supervision, the deployment of fully autonomous CAs remains a challenge. This study evaluates public perception and potential social impacts of fully autonomous CAs for physical support in daily life. To this end, we conducted a large-scale demonstration and survey during Avatar Land, a 19-day public event in Osaka, Japan, where fully autonomous robotic CAs, alongside semi-autonomous CAs, performed daily object retrieval tasks. Specifically, we analyzed responses from 2,285 visitors who engaged with various CAs, including a subset of 333 participants who interacted with fully autonomous CAs and shared their perceptions and concerns through a survey questionnaire. The survey results indicate interest in CAs for physical support in daily life and at work. However, concerns were raised regarding task execution reliability. In contrast, cost and human-like interaction were not dominant concerns. Project page: https://lotfielhafi.github.io/FACA-Survey/.
中文: 本研究评估了公众对全自主控制论化身在日常物理支持方面的兴趣与担忧,发现人们对其应用抱有热情,但对其任务执行可靠性存在显著忧虑,而成本和拟人化交互则非主要关注点。
English: This study assesses public interest and concerns about fully autonomous cybernetic avatars for daily physical support, finding enthusiasm for their use but significant worries about reliability, while cost and human-like interaction are less prominent issues.
Authors:Zhenyu Liu, Yi Ma, Rahim Tafazolli, Zhi Ding
Abstract:
Deep learning-based implicit channel state information (CSI) feedback has been introduced to enhance spectral efficiency in massive MIMO systems. Existing methods often show performance degradation in ultra-low-rate scenarios and inadaptability across diverse environments. In this paper, we propose Dual-ImRUNet, an efficient uplink-assisted deep implicit CSI feedback framework incorporating two novel plug-in preprocessing modules to achieve ultra-low feedback rates while maintaining high environmental robustness. First, a novel bi-directional correlation enhancement module is proposed to strengthen the correlation between uplink and downlink CSI eigenvector matrices. This module projects highly correlated uplink and downlink channel matrices into their respective eigenspaces, effectively reducing redundancy for ultra-low-rate feedback. Second, an innovative input format alignment module is designed to maintain consistent data distributions at both encoder and decoder sides without extra transmission overhead, thereby enhancing robustness against environmental variations. Finally, we develop an efficient transformer-based implicit CSI feedback network to exploit angular-delay domain sparsity and bi-directional correlation for ultra-low-rate CSI compression. Simulation results demonstrate successful reduction of the feedback overhead by 85% compared with the state-of-the-art method and robustness against unseen environments.
中文:提出的Dual-ImRUNet框架通过新型预处理模块和基于变换器的压缩技术,在保持环境鲁棒性的同时将反馈开销降低了85%。
English: The proposed Dual-ImRUNet framework achieves an 85% reduction in feedback overhead while maintaining environmental robustness through novel preprocessing modules and transformer-based compression.
Authors:Kanghyun Ryu, Minjun Sung, Piyush Gupta, Jovin D'sa, Faizan M. Tariq, David Isele, Sangjae Bae
Abstract:
Motion planning for autonomous vehicles (AVs) in dense traffic is challenging, often leading to overly conservative behavior and unmet planning objectives. This challenge stems from the AVs' limited ability to anticipate and respond to the interactive behavior of surrounding agents. Traditional decoupled prediction and planning pipelines rely on non-interactive predictions that overlook the fact that agents often adapt their behavior in response to the AV's actions. To address this, we propose Interaction-Aware Neural Network-Enhanced Model Predictive Path Integral (IANN-MPPI) control, which enables interactive trajectory planning by predicting how surrounding agents may react to each control sequence sampled by MPPI. To improve performance in structured lane environments, we introduce a spline-based prior for the MPPI sampling distribution, enabling efficient lane-changing behavior. We evaluate IANN-MPPI in a dense traffic merging scenario, demonstrating its ability to perform efficient merging maneuvers. Our project website is available at https://sites.google.com/berkeley.edu/iann-mppi
中文摘要:提出的IANN-MPPI控制方法通过预测周围智能体对采样控制序列的反应,实现自动驾驶车辆的交互式轨迹规划,并借助样条曲线先验提升密集交通场景(如汇入车流)中的性能表现。
English Summary: The proposed IANN-MPPI control method enables interactive trajectory planning for autonomous vehicles by predicting surrounding agents' reactions to sampled control sequences, improving performance in dense traffic scenarios like merging through a spline-based prior.
Authors:Darshan Gadginmath, Farhad Nawaz, Minjun Sung, Faizan M Tariq, Sangjae Bae, David Isele, Fabio Pasqualetti, Jovin D'sa
Abstract:
Navigation in dynamic environments requires autonomous systems to reason about uncertainties in the behavior of other agents. In this paper, we introduce a unified framework that combines trajectory planning with multimodal predictions and active probing to enhance decision-making under uncertainty. We develop a novel risk metric that seamlessly integrates multimodal prediction uncertainties through mixture models. When these uncertainties follow a Gaussian mixture distribution, we prove that our risk metric admits a closed-form solution, and is always finite, thus ensuring analytical tractability. To reduce prediction ambiguity, we incorporate an active probing mechanism that strategically selects actions to improve its estimates of behavioral parameters of other agents, while simultaneously handling multimodal uncertainties. We extensively evaluate our framework in autonomous navigation scenarios using the MetaDrive simulation environment. Results demonstrate that our active probing approach successfully navigates complex traffic scenarios with uncertain predictions. Additionally, our framework shows robust performance across diverse traffic agent behavior models, indicating its broad applicability to real-world autonomous navigation challenges. Code and videos are available at https://darshangm.github.io/papers/active-probing-multimodal-predictions/.
中文摘要:本文提出了一种统一框架,将轨迹规划与多模态预测及主动探测相结合,通过新型风险度量和策略性动作选择来提升自动驾驶在不确定环境中的决策能力,并在仿真实验中验证了其处理复杂交通场景的有效性。
English Summary: This paper presents a unified framework for autonomous navigation that integrates trajectory planning with multimodal predictions and active probing to improve decision-making under uncertainty, demonstrating robust performance in complex traffic scenarios through simulations.
Authors:Qi Feng, Yihong Liu, Hinrich Schütze
Abstract:
Curriculum learning is a widely adopted training strategy in natural language processing (NLP), where models are exposed to examples organized by increasing difficulty to enhance learning efficiency and performance. However, most existing approaches rely on manually defined difficulty metrics -- such as text length -- which may not accurately reflect the model's own perspective. To overcome this limitation, we present a self-adaptive curriculum learning paradigm that prioritizes fine-tuning examples based on difficulty scores predicted by pre-trained language models (PLMs) themselves. Building on these scores, we explore various training strategies that differ in the ordering of examples for the fine-tuning: from easy-to-hard, hard-to-easy, to mixed sampling. We evaluate our method on four natural language understanding (NLU) datasets covering both binary and multi-class classification tasks. Experimental results show that our approach leads to faster convergence and improved performance compared to standard random sampling.
中文摘要:本研究提出了一种自适应课程学习方法,利用预训练语言模型自动评估样本难度,实验证明该方法在多个自然语言理解数据集上相比随机采样能实现更快的收敛速度和更好的性能。
English Summary: This study introduces a self-adaptive curriculum learning method that uses pre-trained language models to automatically assess example difficulty, demonstrating improved convergence speed and performance across multiple NLU datasets compared to random sampling.
Authors:Yuheng Huang, Da Song, Zhenlan Ji, Shuai Wang, Lei Ma
Abstract:
By integrating tools from external APIs, Large Language Models (LLMs) have expanded their promising capabilities in a diverse spectrum of complex real-world tasks. However, testing, evaluation, and analysis of LLM tool use remain in their early stages. Most existing benchmarks rely on manually collected test cases, many of which cannot be automatically checked for semantic correctness and instead depend on static methods such as string matching. Additionally, these benchmarks often overlook the complex interactions that occur between sequential API calls, which are common in real-world applications. To fill the gap, in this paper, we introduce StateGen, an automated framework designed to generate diverse coding tasks involving sequential API interactions. StateGen combines state-machine-based API constraint solving and validation, energy-based sampling, and control-flow injection to generate executable programs. These programs are then translated into human-like natural language task descriptions through a collaboration of two LLM agents. Utilizing StateGen, we construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios: Session Service, Tensor Operation, and ElevenLabs MCP. Experimental results confirm that StateGen can effectively generate challenging and realistic API-oriented tasks, highlighting areas for improvement in current LLMs incorporating APIs.
中文摘要:本文提出StateGen自动框架,通过生成涉及顺序API交互的多样化编程任务来弥补当前LLM工具使用评估的不足,并在三个实际场景中通过StateEval基准测试验证了其生成挑战性任务的有效性。
English Summary: This paper introduces StateGen, an automated framework that generates diverse coding tasks involving sequential API interactions to address limitations in current LLM tool-use evaluation, and demonstrates its effectiveness through the StateEval benchmark across three practical scenarios.
Authors:Bingshen Mu, Kun Wei, Pengcheng Guo, Lei Xie
Abstract:
Despite improvements in automatic speech recognition, performance drops with accented speech. Generative error correction (GER) leverages the linguistic knowledge of large language models (LLMs), outperforming typical language model methods. However, it lacks specificity in accented speech scenarios. Accents represent deviations from standard pronunciation, making multi-granularity pronunciation and semantic information essential for accented speech recognition. Moreover, accents exhibit considerable diversity, with each accent possessing distinct characteristics. In this study, we leverage GER to improve transcription accuracy by addressing the two primary features. We propose the multi-modal GER, which integrates pronunciation information from the speech modality, and the multi-granularity GER, which incorporates fine-grained phoneme-level pronunciation information. These methods enable the LLM to utilize the pronunciation information of accented speech and the semantic information from word-level hypotheses for accurate transcription predictions through low-rank adaptation (LoRA) fine-tuning. We employ a three-stage strategy to train separate multi-modal GER models for each accent to obtain mono-accent LoRA experts. By adopting our proposed HDMoLE method, which incorporates hierarchical routing and dynamic thresholds within the mixture of LoRA experts, we effectively merge mono-accent LoRA experts within a single multi-modal GER to overcome accent diversity challenges. Furthermore, multi-granularity GER leverages N-best word-level and phoneme-level hypotheses from the HDMoLE model to predict final transcriptions. Experiments on a multi-accent English dataset show that our methods reduce word error rate by 67.35% compared to the baseline vanilla Whisper-large-v3 model.
中文: 本研究提出了多模态和多粒度生成式纠错方法,通过LoRA微调整合发音与语义信息,相比基线模型,在多口音英语数据集上将词错误率降低了67.35%。
English: This study introduces multi-modal and multi-granularity generative error correction methods that integrate pronunciation and semantic information through LoRA fine-tuning, reducing word error rates by 67.35% for accented speech recognition compared to baseline models.
Authors:Zhongzhang Chen, Miao Fan, Shengtong Xu, Mengmeng Yang, Kun Jiang, Xiangzeng Liu, Haoyi Xiong
Abstract:
High-definition (HD) semantic mapping of complex intersections poses significant challenges for traditional vehicle-based approaches due to occlusions and limited perspectives. This paper introduces a novel camera-LiDAR fusion framework that leverages elevated intelligent roadside units (IRUs). Additionally, we present RS-seq, a comprehensive dataset developed through the systematic enhancement and annotation of the V2X-Seq dataset. RS-seq includes precisely labelled camera imagery and LiDAR point clouds collected from roadside installations, along with vectorized maps for seven intersections annotated with detailed features such as lane dividers, pedestrian crossings, and stop lines. This dataset facilitates the systematic investigation of cross-modal complementarity for HD map generation using IRU data. The proposed fusion framework employs a two-stage process that integrates modality-specific feature extraction and cross-modal semantic integration, capitalizing on camera high-resolution texture and precise geometric data from LiDAR. Quantitative evaluations using the RS-seq dataset demonstrate that our multimodal approach consistently surpasses unimodal methods. Specifically, compared to unimodal baselines evaluated on the RS-seq dataset, the multimodal approach improves the mean Intersection-over-Union (mIoU) for semantic segmentation by 4\% over the image-only results and 18\% over the point cloud-only results. This study establishes a baseline methodology for IRU-based HD semantic mapping and provides a valuable dataset for future research in infrastructure-assisted autonomous driving systems.
Chinese: 本文提出了一种基于路侧单元的相机与激光雷达融合框架,以解决复杂路口高清语义地图构建中的遮挡问题,并发布了RS-seq数据集,实验表明该多模态方法比单模态图像和点云在mIoU上分别提升4%和18%。
English: This paper proposes a camera-LiDAR fusion framework using elevated roadside units to overcome occlusion limitations in HD semantic mapping, introducing the RS-seq dataset which demonstrates a 4% mIoU improvement over image-only and 18% over point cloud-only methods.
Authors:Jingzhao Xie, Zhenglian Li, Gang Sun, Long Luo, Hongfang Yu, Dusit Niyato
Abstract:
Computing Power Network (CPN) unifies wide-area computing resources through coordinated network control, while cloud-native abstractions enable flexible resource orchestration and on-demand service provisioning atop the elastic infrastructure CPN provides. However, current approaches fall short of fully integrating computing resources via network-enabled coordination as envisioned by CPN. In particular, optimally mapping services to an underlying infrastructure to maximize resource efficiency and service satisfaction remains challenging. To overcome this challenge, we formally define the service mapping problem in CPN, establish its theoretical intractability, and identify key challenges in practical optimization. We propose Adaptive Bilevel Search (ABS), a modular framework featuring (1) graph partitioning-based reformulation to capture variable coupling, (2) a bilevel optimization architecture for efficient global exploration with local optimality guarantees, and (3) fragmentation-aware evaluation for global performance guidance. Implemented using distributed particle swarm optimization, ABS is extensively evaluated across diverse CPN scenarios, consistently outperforming existing approaches. Notably, in complex scenarios, ABS achieves up to 73.2% higher computing resource utilization and a 60.2% higher service acceptance ratio compared to the best-performing baseline.
中文: 本文提出自适应双层搜索(ABS)框架,通过优化资源分配和服务布局解决算力网络中的服务映射难题,在资源利用率和服务接受率上显著超越现有方法。
English: The paper introduces Adaptive Bilevel Search (ABS), a modular framework that addresses the service mapping challenge in Computing Power Networks by optimizing resource allocation and service placement, significantly outperforming existing methods in resource utilization and service acceptance.
Authors:Hao Huang, Shuaihang Yuan, Geeta Chandra Raju Bethala, Congcong Wen, Anthony Tzes, Yi Fang
Abstract:
Policy learning focuses on devising strategies for agents in embodied artificial intelligence systems to perform optimal actions based on their perceived states. One of the key challenges in policy learning involves handling complex, long-horizon tasks that require managing extensive sequences of actions and observations with multiple modes. Wavelet analysis offers significant advantages in signal processing, notably in decomposing signals at multiple scales to capture both global trends and fine-grained details. In this work, we introduce a novel wavelet policy learning framework that utilizes wavelet transformations to enhance policy learning. Our approach leverages learnable multi-scale wavelet decomposition to facilitate detailed observation analysis and robust action planning over extended sequences. We detail the design and implementation of our wavelet policy, which incorporates lifting schemes for effective multi-resolution analysis and action generation. This framework is evaluated across multiple complex scenarios, including robotic manipulation, self-driving, and multi-robot collaboration, demonstrating the effectiveness of our method in improving the precision and reliability of the learned policy.
Chinese: 本文提出了一种新颖的小波策略学习框架,利用多尺度小波分解来增强复杂长程任务中的策略学习,在机器人操控、自动驾驶和多机器人协作等场景中证明了该方法能有效提高策略的精确性和可靠性。
English: This paper introduces a novel wavelet policy learning framework that leverages multi-scale wavelet decomposition to enhance policy learning in complex, long-horizon tasks, demonstrating improved precision and reliability across robotic manipulation, self-driving, and multi-robot collaboration scenarios.
Authors:Yuqi Ye, Li You, Hao Xu, Ahmed Elzanaty, Kai-Kit Wong, Xiqi Gao
Abstract:
With the development of the upcoming sixth-generation (6G) wireless networks, there is a pressing need for innovative technologies capable of satisfying heightened performance indicators. Fluid antenna system (FAS) is proposed recently as a promising technique to achieve higher data rates and more diversity gains by dynamically changing the positions of the antennas to form a more desirable channel. However, worries regarding the possibly harmful effects of electromagnetic (EM) radiation emitted by devices have arisen as a result of the rapid evolution of advanced techniques in wireless communication systems. Specific absorption rate (SAR) is a widely adopted metric to quantify EM radiation worldwide. In this paper, we investigate the SAR-aware multiuser multiple-input multiple-output (MIMO) communications assisted by FAS. In particular, a two-layer iterative algorithm is proposed to minimize the SAR value under signal-to-interference-plus-noise ratio (SINR) and FAS constraints. Moreover, the minimum weighted SINR maximization problem under SAR and FAS constraints is studied by finding its relationship with the SAR minimization problem. Simulation results verify that the proposed SAR-aware FAS design outperforms the adaptive backoff and fixed-position antenna designs.
中文摘要:本文提出了一种流体天线系统(FAS),通过SAR感知优化在6G多用户MIMO通信中提升性能,同时解决电磁辐射问题。
English Summary: The paper introduces a fluid antenna system (FAS) to enhance 6G network performance while addressing electromagnetic radiation concerns through SAR-aware optimization in multiuser MIMO communications.
Authors:Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias NieÃner, Joan Lasenby
Abstract:
We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines -- such as object individuality, articulation, high-quality physically based rendering materials, and physically based interaction. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and objects with the help of a structured scene graph. It then reconstructs the scene by retrieving the most visually similar 3D artist-crafted models from a curated asset database. Next, the Material Painting module enhances realism by recovering high-quality, spatially varying materials. Finally, the reconstructed scene is integrated into a simulation engine with basic physical properties to enable interactive behavior. The resulting scenes are compact, editable, and fully compatible with standard graphics pipelines, making them suitable for applications in AR/VR, gaming, robotics, and digital twins. In addition, LiteReality introduces a training-free object retrieval module that achieves state-of-the-art similarity performance on the Scan2CAD benchmark, along with a robust material painting module capable of transferring appearances from images of any style to 3D assets -- even under severe misalignment, occlusion, and poor lighting. We demonstrate the effectiveness of LiteReality on both real-life scans and public datasets. Project page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c
Chinese: LiteReality 是一种创新流程,可将 RGB-D 扫描转换为紧凑、逼真且可交互的 3D 虚拟场景,具备物体独立性、关节运动和高质量材质特性,适用于增强现实、游戏和机器人等领域。
English: LiteReality is a novel pipeline that converts RGB-D scans into compact, realistic, and interactive 3D virtual replicas, supporting object individuality, articulation, and high-quality materials for applications in AR/VR, gaming, and robotics.
Authors:Yunhan Yang, Shuo Chen, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo, Edmund Y. Lam, Hengshuang Zhao, Tong He, Xihui Liu
Abstract:
Recent advancements in leveraging pre-trained 2D diffusion models achieve the generation of high-quality novel views from a single in-the-wild image. However, existing works face challenges in producing controllable novel views due to the lack of information from multiple views. In this paper, we present DreamComposer++, a flexible and scalable framework designed to improve current view-aware diffusion models by incorporating multi-view conditions. Specifically, DreamComposer++ utilizes a view-aware 3D lifting module to extract 3D representations of an object from various views. These representations are then aggregated and rendered into the latent features of target view through the multi-view feature fusion module. Finally, the obtained features of target view are integrated into pre-trained image or video diffusion models for novel view synthesis. Experimental results demonstrate that DreamComposer++ seamlessly integrates with cutting-edge view-aware diffusion models and enhances their abilities to generate controllable novel views from multi-view conditions. This advancement facilitates controllable 3D object reconstruction and enables a wide range of applications.
中文: DreamComposer++ 框架通过多视角条件整合与三维特征融合,增强了现有视图感知扩散模型的能力,实现了从单张图像生成可控新视角并促进三维物体重建。
English: DreamComposer++ is a framework that enhances view-aware diffusion models by integrating multi-view conditions through 3D lifting and feature fusion, enabling controllable novel view synthesis and 3D reconstruction from single images.
Authors:Kai Chen, Ruiyuan Gao, Lanqing Hong, Hang Xu, Xu Jia, Holger Caesar, Dengxin Dai, Bingbing Liu, Dzmitry Tsishkou, Songcen Xu, Chunjing Xu, Qiang Xu, Huchuan Lu, Dit-Yan Yeung
Abstract:
In this paper, we present details of the 1st W-CODA workshop, held in conjunction with the ECCV 2024. W-CODA aims to explore next-generation solutions for autonomous driving corner cases, empowered by state-of-the-art multimodal perception and comprehension techniques. 5 Speakers from both academia and industry are invited to share their latest progress and opinions. We collect research papers and hold a dual-track challenge, including both corner case scene understanding and generation. As the pioneering effort, we will continuously bridge the gap between frontier autonomous driving techniques and fully intelligent, reliable self-driving agents robust towards corner cases.
中文摘要:首届W-CODA工作坊在ECCV 2024期间举办,聚焦通过多模态AI技术解决自动驾驶极端案例,包含学术研讨和场景理解与生成的双轨挑战。
English Summary: The 1st W-CODA workshop at ECCV 2024 focuses on advancing autonomous driving solutions for corner cases through multimodal AI, featuring expert talks and dual-track challenges on scene understanding and generation.
Authors:Chuhao Xu, Zijun Li, Quan Chen, Han Zhao, Minyi Guo
Abstract:
The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-scale models and infrequent requests. While existing solutions follow exclusive GPU deployment, we take a step back to explore modern platforms and find that: Emerging CPU architectures with built-in accelerators are capable of serving LLMs but remain underutilized, and both CPUs and GPUs can accommodate multiple LLMs simultaneously.
We propose LLM-Mesh, a serverless inference scheme for small-to-mid-sized LLMs that enables elastic sharing across heterogeneous hardware. LLM-Mesh tackles three fundamental challenges: (1) precise, fine-grained compute resource allocation at token-level to handle fluctuating computational demands; (2) a coordinated and forward-looking memory scaling mechanism to detect out-of-memory hazards and reduce operational overhead; and (3) a dual approach that reduces resource fragmentation through proactive preemption and reactive bin-packing. Experimental results on 4 32-core CPUs and 4 A100 GPUs show that LLM-Meshimproves service capacity by 44% - 63% through sharing, while further leveraging CPUs boosts this to 91% - 159%.
中文: 随着大语言模型的兴起,私有无服务器部署需求增长,提出的LLM-Mesh系统实现了跨异构硬件的弹性共享,通过精细资源分配和内存管理将服务能力提升高达159%。
English: The emergence of LLMs has created a need for private serverless deployments, and the proposed LLM-Mesh system enables elastic sharing across heterogeneous hardware, improving service capacity by up to 159% through efficient resource allocation and memory management.
Authors:Deniz Gorur, Antonio Rago, Francesca Toni
Abstract:
Judgmental forecasting employs human opinions to make predictions about future events, rather than exclusively historical data as in quantitative forecasting. When these opinions form an argumentative structure around forecasts, it is useful to study the properties of the forecasts from an argumentative perspective. In this paper, we advocate and formally define a property of argumentative coherence, which, in essence, requires that a forecaster's reasoning is coherent with their forecast. We then conduct three evaluations with our notion of coherence. First, we assess the impact of enforcing coherence on human forecasters as well as on Large Language Model (LLM)-based forecasters, given that they have recently shown to be competitive with human forecasters. In both cases, we show that filtering out incoherent predictions improves forecasting accuracy consistently, supporting the practical value of coherence in both human and LLM-based forecasting. Then, via crowd-sourced user experiments, we show that, despite its apparent intuitiveness and usefulness, users do not generally align with this coherence property. This points to the need to integrate, within argumentation-based judgmental forecasting, mechanisms to filter out incoherent opinions before obtaining group forecasting predictions.
中文: 判断性预测依赖人类观点进行预测,本文提出了论证一致性的概念,证明过滤不一致预测能提高人类和大型语言模型的准确性,尽管用户通常不自然遵循这一特性。
English: Judgmental forecasting relies on human opinions for predictions, and this paper introduces the concept of argumentative coherence, demonstrating that filtering out incoherent forecasts improves accuracy for both humans and LLMs, despite users not naturally aligning with this property.
Authors:Abhishek Sawaika, Swetang Krishna, Tushar Tomar, Durga Pritam Suggisetti, Aditi Lal, Tanmaya Shrivastav, Nouhaila Innan, Muhammad Shafique
Abstract:
Rapid growth of digital transactions has led to a surge in fraudulent activities, challenging traditional detection methods in the financial sector. To tackle this problem, we introduce a specialised federated learning framework that uniquely combines a quantum-enhanced Long Short-Term Memory (LSTM) model with advanced privacy preserving techniques. By integrating quantum layers into the LSTM architecture, our approach adeptly captures complex cross-transactional patters, resulting in an approximate 5% performance improvement across key evaluation metrics compared to conventional models. Central to our framework is "FedRansel", a novel method designed to defend against poisoning and inference attacks, thereby reducing model degradation and inference accuracy by 4-8%, compared to standard differential privacy mechanisms. This pseudo-centralised setup with a Quantum LSTM model, enhances fraud detection accuracy and reinforces the security and confidentiality of sensitive financial data.
中文: 该联邦学习框架结合量子增强LSTM模型,将欺诈检测性能提升约5%,并通过FedRansel方法强化安全防护,相比传统隐私保护机制有效降低了模型退化风险。
English: The proposed federated learning framework integrates quantum-enhanced LSTM to improve fraud detection performance by approximately 5% while introducing FedRansel to enhance security against attacks, outperforming traditional privacy methods.
Authors:Shreyansh Pathak, Sonu Shreshtha, Richa Singh, Mayank Vatsa
Abstract:
The widespread adoption of voice-enabled authentication and audio biometric systems have significantly increased privacy vulnerabilities associated with sensitive speech data. Compliance with privacy regulations such as GDPR's right to be forgotten and India's DPDP Act necessitates targeted and efficient erasure of individual-specific voice signatures from already-trained biometric models. Existing unlearning methods designed for visual data inadequately handle the sequential, temporal, and high-dimensional nature of audio signals, leading to ineffective or incomplete speaker and accent erasure. To address this, we introduce QPAudioEraser, a quantum-inspired audio unlearning framework. Our our-phase approach involves: (1) weight initialization using destructive interference to nullify target features, (2) superposition-based label transformations that obscure class identity, (3) an uncertainty-maximizing quantum loss function, and (4) entanglement-inspired mixing of correlated weights to retain model knowledge. Comprehensive evaluations with ResNet18, ViT, and CNN architectures across AudioMNIST, Speech Commands, LibriSpeech, and Speech Accent Archive datasets validate QPAudioEraser's superior performance. The framework achieves complete erasure of target data (0% Forget Accuracy) while incurring minimal impact on model utility, with a performance degradation on retained data as low as 0.05%. QPAudioEraser consistently surpasses conventional baselines across single-class, multi-class, sequential, and accent-level erasure scenarios, establishing the proposed approach as a robust privacy-preserving solution.
中文:QPAudioEraser框架通过量子启发的技术,有效从已训练的生物特征模型中删除个人语音数据,在确保完全符合隐私法规的同时,对模型性能影响极小。
English: The QPAudioEraser framework effectively removes individual voice data from trained biometric models using quantum-inspired techniques, ensuring full privacy compliance with minimal impact on model performance.
Authors:Le Liang, Hao Ye, Yucheng Sheng, Ouya Wang, Jiacheng Wang, Shi Jin, Geoffrey Ye Li
Abstract:
The emergence of large language models (LLMs) has revolutionized artificial intelligence, offering unprecedented capabilities in reasoning, generalization, and zero-shot learning. These strengths open new frontiers in wireless communications, where increasing complexity and dynamics demand intelligent and adaptive solutions. This article explores the role of LLMs in transforming wireless systems across three key directions: adapting pretrained LLMs for core communication tasks, developing wireless-specific foundation models to balance versatility and efficiency, and enabling agentic LLMs with autonomous reasoning and coordination capabilities. We highlight recent advances, practical case studies, and the unique benefits of LLM-based approaches over traditional methods. Finally, we outline open challenges and research opportunities, including multimodal fusion, collaboration with lightweight models, and self-improving capabilities, charting a path toward intelligent, adaptive, and autonomous wireless networks of the future.
中文: 大语言模型通过预训练模型适配、通信专用基础模型开发以及智能体自主推理三大方向,正在推动无线通信系统向智能自适应网络变革。
English: Large language models are revolutionizing wireless communications by enabling intelligent solutions for complex tasks through adaptation, specialized foundation models, and autonomous reasoning capabilities.
Authors:Chloe Ho, Ishneet Sukhvinder Singh, Diya Sharma, Tanvi Reddy Anumandla, Michael Lu, Vasu Sharma, Kevin Zhu
Abstract:
Search algorithms and user query relevance have given LLMs the ability to return relevant information, but the effect of content phrasing on ad visibility remains underexplored. We investigate how LLM-based rewriting of advertisements can improve their ranking in retrieval systems and inclusion in generated LLM responses, without modifying the retrieval model itself. We introduce a supervised fine-tuning framework with a custom loss balancing semantic relevance and content fidelity. To evaluate effectiveness, we propose two metrics: DeltaMRR@K (ranking improvement) and DeltaDIR@K (inclusion frequency improvement). Our approach presents a scalable method to optimize ad phrasing, enhancing visibility in retrieval-based LLM workflows. Experiments across both instruction-based and few-shot prompting demonstrate that PPO trained models outperform both prompt engineering and supervised fine-tuning in most cases, achieving up to a 2.79 DeltaDIR@5 and 0.0073 DeltaMRR@5 in instruction-based prompting. These results highlight the importance of how the ad is written before retrieval and prompt format and reinforcement learning in effective ad rewriting for LLM integrated retrieval systems.
Chinese Summary: 本研究提出了一种带有自定义损失函数的监督微调框架,通过优化广告措辞,证明强化学习训练的模型能显著提升广告在LLM检索系统中的可见性,有效改善排名和包含频率指标。
English Summary: This study introduces a supervised fine-tuning framework with a custom loss function to optimize ad phrasing, demonstrating that reinforcement learning-trained models significantly enhance ad visibility in LLM retrieval systems by improving both ranking and inclusion metrics.
Authors:Arpan Dasgupta, Sarvesh Gharat, Neha Madhiwalla, Aparna Hegde, Milind Tambe, Aparna Taneja
Abstract:
Automated voice calls with health information are a proven method for disseminating maternal and child health information among beneficiaries and are deployed in several programs around the world. However, these programs often suffer from beneficiary dropoffs and poor engagement. In previous work, through real-world trials, we showed that an AI model, specifically a restless bandit model, could identify beneficiaries who would benefit most from live service call interventions, preventing dropoffs and boosting engagement. However, one key question has remained open so far: does such improved listenership via AI-targeted interventions translate into beneficiaries' improved knowledge and health behaviors? We present a first study that shows not only listenership improvements due to AI interventions, but also simultaneously links these improvements to health behavior changes. Specifically, we demonstrate that AI-scheduled interventions, which enhance listenership, lead to statistically significant improvements in beneficiaries' health behaviors such as taking iron or calcium supplements in the postnatal period, as well as understanding of critical health topics during pregnancy and infancy. This underscores the potential of AI to drive meaningful improvements in maternal and child health.
中文: 人工智能定向语音呼叫不仅能提高听众参与度,还能显著改善母婴健康行为,如补充剂摄入和健康知识掌握。
English: AI-targeted voice calls not only increase listenership but also significantly improve maternal and child health behaviors, such as supplement intake and health knowledge.
Authors:Vidhi Oad, Param Pathak, Nouhaila Innan, Shalini D, Muhammad Shafique
Abstract:
Forecasting in financial markets remains a significant challenge due to their nonlinear and regime-dependent dynamics. Traditional deep learning models, such as long short-term memory networks and multilayer perceptrons, often struggle to generalize across shifting market conditions, highlighting the need for a more adaptive and interpretable approach. To address this, we introduce Kolmogorov-Arnold networks for stock prediction and explainable regimes (KASPER), a novel framework that integrates regime detection, sparse spline-based function modeling, and symbolic rule extraction. The framework identifies hidden market conditions using a Gumbel-Softmax-based mechanism, enabling regime-specific forecasting. For each regime, it employs Kolmogorov-Arnold networks with sparse spline activations to capture intricate price behaviors while maintaining robustness. Interpretability is achieved through symbolic learning based on Monte Carlo Shapley values, which extracts human-readable rules tailored to each regime. Applied to real-world financial time series from Yahoo Finance, the model achieves an $R^2$ score of 0.89, a Sharpe Ratio of 12.02, and a mean squared error as low as 0.0001, outperforming existing methods. This research establishes a new direction for regime-aware, transparent, and robust forecasting in financial markets.
Chinese: KASPER框架通过引入Kolmogorov-Arnold网络、状态检测和符号规则提取,实现了自适应且可解释的股票预测,以0.89的R²和12.02的夏普比率显著优于传统模型。
English: The KASPER framework introduces Kolmogorov-Arnold networks with regime detection and symbolic rule extraction to achieve adaptive, interpretable stock prediction, significantly outperforming traditional models with an R² of 0.89 and a Sharpe Ratio of 12.02.
Authors:Yanzuo Lu, Yuxi Ren, Xin Xia, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Andy J. Ma, Xiaohua Xie, Jian-Huang Lai
Abstract:
Distribution Matching Distillation (DMD) is a promising score distillation technique that compresses pre-trained teacher diffusion models into efficient one-step or multi-step student generators. Nevertheless, its reliance on the reverse Kullback-Leibler (KL) divergence minimization potentially induces mode collapse (or mode-seeking) in certain applications. To circumvent this inherent drawback, we propose Adversarial Distribution Matching (ADM), a novel framework that leverages diffusion-based discriminators to align the latent predictions between real and fake score estimators for score distillation in an adversarial manner. In the context of extremely challenging one-step distillation, we further improve the pre-trained generator by adversarial distillation with hybrid discriminators in both latent and pixel spaces. Different from the mean squared error used in DMD2 pre-training, our method incorporates the distributional loss on ODE pairs collected from the teacher model, and thus providing a better initialization for score distillation fine-tuning in the next stage. By combining the adversarial distillation pre-training with ADM fine-tuning into a unified pipeline termed DMDX, our proposed method achieves superior one-step performance on SDXL compared to DMD2 while consuming less GPU time. Additional experiments that apply multi-step ADM distillation on SD3-Medium, SD3.5-Large, and CogVideoX set a new benchmark towards efficient image and video synthesis.
中文摘要:DMDX作为一种改进的对抗蒸馏框架,通过引入基于扩散的判别器和分布损失克服了DMD2的模式塌陷缺陷,在SDXL上以更少GPU时间实现更优的单步生成性能,同时为高效图像和视频合成树立了新标杆。
English Summary: DMDX is an enhanced adversarial distillation framework that overcomes the mode collapse limitation of DMD2 by incorporating diffusion-based discriminators and distributional loss, achieving superior one-step performance on SDXL with reduced GPU time while setting new benchmarks in efficient image and video synthesis.
Authors:Minxian Xu, Junhan Liao, Jingfeng Wu, Yiyuan He, Kejiang Ye, Chengzhong Xu
Abstract:
Large Language Models (LLMs) are revolutionizing numerous industries, but their substantial computational demands create challenges for efficient deployment, particularly in cloud environments. Traditional approaches to inference serving often struggle with resource inefficiencies, leading to high operational costs, latency issues, and limited scalability. This article explores how Cloud Native technologies, such as containerization, microservices, and dynamic scheduling, can fundamentally improve LLM inference serving. By leveraging these technologies, we demonstrate how a Cloud Native system enables more efficient resource allocation, reduces latency, and enhances throughput in high-demand scenarios. Through real-world evaluations using Kubernetes-based autoscaling, we show that Cloud Native architectures can dynamically adapt to workload fluctuations, mitigating performance bottlenecks while optimizing LLM inference serving performance. This discussion provides a broader perspective on how Cloud Native frameworks could reshape the future of scalable LLM inference serving, offering key insights for researchers, practitioners, and industry leaders in cloud computing and artificial intelligence.
中文: 云原生技术通过容器化和基于Kubernetes的自动扩缩容,能够优化大语言模型推理服务的资源效率、降低延迟并实现动态可扩展性。
English: Cloud Native technologies enhance LLM inference serving by improving resource efficiency, reducing latency, and enabling dynamic scalability through containerization and Kubernetes-based autoscaling.
Authors:Jingfeng Wu, Yiyuan He, Minxian Xu, Xitong Gao, Kejiang Ye, Chengzhong Xu
Abstract:
The rise of large language models (LLMs) has created new opportunities across various fields but has also introduced significant challenges in resource management. Current LLM serving systems face a fundamental tension: balancing serving demands with limited resources while adapting to unpredictable traffic patterns. Static deployments lead to suboptimal resource utilization and performance degradation under dynamic workloads. Furthermore, the high cost of adjusting instances hinders dynamic scaling, limiting the true potential of efficient LLM serving.
To address this, we propose CoCoServe, an elastic system that facilitates dynamic and fine-grained scaling. Its key innovation lies in the module-level operations for the replication and migration of LLM modules, such as decoder layers and projections. Through a comprehensive analysis of the trade-offs associated with these operations, we develop an auto-scaling mechanism that dynamically regulates module-level resource allocation and performance optimization, enabling a more cost-effective deployment of LLMs. Our evaluation demonstrates that the scaling operations employed by CoCoServe exhibit excellent scalability and can reduce costs by 46% while maintaining availability. Compared to state-of-the-art LLM serving systems (e.g., Hugging Face Transformers and vLLM), our approach reduces latency by 14%-75% and achieves 1.16x-4x throughput on average across different model sizes and workloads.
Chinese: CoCoServe提出了一种弹性系统,通过模块级动态扩展优化大型语言模型的资源分配与性能,在降低成本46%的同时,相比现有系统显著提升了响应速度与吞吐量。
English: CoCoServe introduces an elastic system for large language models that enables dynamic, module-level scaling to optimize resource allocation and performance, reducing costs by 46% while improving latency and throughput compared to existing systems.
Authors:Shengye Song, Minxian Xu, Zuowei Zhang, Chengxi Gao, Fansong Zeng, Yu Ding, Kejiang Ye, Chengzhong Xu
Abstract:
Microservices transform traditional monolithic applications into lightweight, loosely coupled application components and have been widely adopted in many enterprises. Cloud platform infrastructure providers enhance the resource utilization efficiency of microservices systems by co-locating different microservices. However, this approach also introduces resource competition and interference among microservices. Designing interference-aware strategies for large-scale, co-located microservice clusters is crucial for enhancing resource utilization and mitigating competition-induced interference. These challenges are further exacerbated by unreliable metrics, application diversity, and node heterogeneity.
In this paper, we first analyze the characteristics of large-scale and co-located microservices clusters at Alibaba and further discuss why cycle per instruction (CPI) is adopted as a metric for interference measurement in large-scale production clusters, as well as how to achieve accurate prediction of CPI through multi-dimensional metrics. Based on CPI interference prediction and analysis, we also present the design of the C-Koordinator platform, an open-source solution utilized in Alibaba cluster, which incorporates co-location and interference mitigation strategies. The interference prediction models consistently achieve over 90.3% accuracy, enabling precise prediction and rapid mitigation of interference in operational environments. As a result, application latency is reduced and stabilized across all percentiles (P50, P90, P99) response time (RT), achieving improvements ranging from 16.7% to 36.1% under various system loads compared with state-of-the-art system. These results demonstrate the system's ability to maintain smooth application performance in co-located environments.
中文摘要:微服务共置虽提升云平台资源效率但引发干扰,C-Koordinator平台通过基于CPI的预测模型实现超90.3%的准确率,使各百分位应用延迟降低16.7%-36.1%,有效保障共置环境性能稳定。
English Summary: Microservices co-location in cloud platforms improves resource efficiency but causes interference, which the C-Koordinator platform addresses through CPI-based prediction models achieving over 90.3% accuracy and reducing application latency by 16.7%-36.1% across all percentiles.
Authors:Paulo Mendes, Eva Maia, Isabel Praça
Abstract:
Phishing emails continue to pose a significant threat to cybersecurity by exploiting human vulnerabilities through deceptive content and malicious payloads. While Machine Learning (ML) models are effective at detecting phishing threats, their performance largely relies on the quality and diversity of the training data. This paper presents MeAJOR (Merged email Assets from Joint Open-source Repositories) Corpus, a novel, multi-source phishing email dataset designed to overcome critical limitations in existing resources. It integrates 135894 samples representing a broad number of phishing tactics and legitimate emails, with a wide spectrum of engineered features. We evaluated the dataset's utility for phishing detection research through systematic experiments with four classification models (RF, XGB, MLP, and CNN) across multiple feature configurations. Results highlight the dataset's effectiveness, achieving 98.34% F1 with XGB. By integrating broad features from multiple categories, our dataset provides a reusable and consistent resource, while addressing common challenges like class imbalance, generalisability and reproducibility.
Chinese Summary: 本文提出MeAJOR语料库,这是一个整合多来源钓鱼邮件和合法邮件的新型数据集,通过综合特征工程有效提升检测性能,在XGBoost分类器中实现了98.34%的F1分数,解决了现有数据集的泛化性和可复现性等关键问题。
English Summary: This paper introduces the MeAJOR Corpus, a comprehensive multi-source phishing email dataset that enhances detection capabilities by integrating diverse samples and engineered features, achieving 98.34% F1 score with XGBoost in evaluations.
Authors:Ao Liang, Lingdong Kong, Dongyue Lu, Youquan Liu, Jian Fang, Huaici Zhao, Wei Tsang Ooi
Abstract:
With the rise of robotics, LiDAR-based 3D object detection has garnered significant attention in both academia and industry. However, existing datasets and methods predominantly focus on vehicle-mounted platforms, leaving other autonomous platforms underexplored. To bridge this gap, we introduce Pi3DET, the first benchmark featuring LiDAR data and 3D bounding box annotations collected from multiple platforms: vehicle, quadruped, and drone, thereby facilitating research in 3D object detection for non-vehicle platforms as well as cross-platform 3D detection. Based on Pi3DET, we propose a novel cross-platform adaptation framework that transfers knowledge from the well-studied vehicle platform to other platforms. This framework achieves perspective-invariant 3D detection through robust alignment at both geometric and feature levels. Additionally, we establish a benchmark to evaluate the resilience and robustness of current 3D detectors in cross-platform scenarios, providing valuable insights for developing adaptive 3D perception systems. Extensive experiments validate the effectiveness of our approach on challenging cross-platform tasks, demonstrating substantial gains over existing adaptation methods. We hope this work paves the way for generalizable and unified 3D perception systems across diverse and complex environments. Our Pi3DET dataset, cross-platform benchmark suite, and annotation toolkit have been made publicly available.
中文: 本文提出了首个多平台LiDAR数据集Pi3DET及基准测试,开发了通过几何与特征对齐实现视角不变检测的跨平台适配框架,在跨平台任务中显著优于现有方法。
English: This paper introduces Pi3DET, the first multi-platform LiDAR dataset and benchmark for 3D object detection, along with a novel cross-platform adaptation framework that achieves perspective-invariant detection through geometric and feature alignment, demonstrating superior performance in cross-platform scenarios.
Authors:Jianmin Hu, Minxian Xu, Kejiang Ye, Chengzhong Xu
Abstract:
In recent years, the Mixture-of-Experts (MoE) architecture has been widely applied to large language models (LLMs), providing a promising solution that activates only a subset of the model's parameters during computation, thereby reducing overall memory requirements and allowing for faster inference compared to dense models. Despite these advantages, existing systems still face issues of low efficiency due to static model placement and lack of dynamic workloads adaptation. This leads to suboptimal resource utilization and increased latency, especially during bursty requests periods.
To address these challenges, this paper introduces BrownoutServe, a novel serving framework designed to optimize inference efficiency and maintain service reliability for MoE-based LLMs under dynamic computational demands and traffic conditions. BrownoutServe introduces "united experts" that integrate knowledge from multiple experts, reducing the times of expert access and inference latency. Additionally, it proposes a dynamic brownout mechanism to adaptively adjust the processing of certain tokens, optimizing inference performance while guaranteeing service level objectives (SLOs) are met. Our evaluations show the effectiveness of BrownoutServe under various workloads: it achieves up to 2.07x throughput improvement compared to vLLM and reduces SLO violations by 90.28%, showcasing its robustness under bursty traffic while maintaining acceptable inference accuracy.
中文摘要:BrownoutServe是一种新颖的服务框架,通过引入联合专家和动态调节机制来优化混合专家大语言模型的推理效率,在不同工作负载下实现了显著的吞吐量提升和延迟降低。
English Summary: BrownoutServe is a novel serving framework that enhances inference efficiency for Mixture-of-Experts LLMs by introducing united experts and dynamic adaptation mechanisms, achieving significant throughput improvements and reduced latency under varying workloads.
Authors:Minxian Xu, Linfeng Wen, Junhan Liao, Huaming Wu, Kejiang Ye, Chengzhong Xu
Abstract:
The interactions within cloud-native applications are complex, with a constantly changing number of services and loads, posing higher demands on auto-scaling approach. This mainly involves several challenges such as microservices dependency analysis, performance profiling, anomaly detection, workload characterization and task co-location. Therefore, some advanced algorithms have been investigated into auto-scaling cloud-native applications to optimize system and application performance. These algorithms can learn from historical data and appropriately adjust resource allocation based on the current environment and load conditions to optimize resource utilization and system performance. In this paper, we systematically review the literature on state-of-the-art auto-scaling approaches for cloud-native applications from 2020, and further explore the technological evolution. Additionally, we propose a detailed taxonomy to categorize current research from five perspectives, including infrastructure, architecture, scaling methods, optimization objectives, and behavior modeling. Then, we provide a comprehensive comparison and in-depth discussion of the key features, advantages, limitations, and application scenarios of each approach, considering their performance in diverse environments and under various conditions. Finally, we summarize the current state of research in this field, identify the gaps and unresolved challenges, and emphasize promising directions for future exploration, particularly in areas such as the application of large models, microservice dependency management, and the use of meta-learning techniques to enhance model applicability and adaptability across different environments.
中文摘要:本文系统综述了2020年以来云原生应用的先进自动扩缩容方法,通过多维度分类比较了不同方案的特征与局限,并指出大模型应用、微服务依赖管理等未来研究方向。
English Summary: This paper systematically reviews and categorizes advanced auto-scaling approaches for cloud-native applications since 2020, analyzing their features and limitations while identifying future research directions including large model applications and dependency management.
Authors:Wanyi Zheng, Minxian Xu, Shengye Song, Kejiang Ye
Abstract:
Large language models (LLMs) have become increasingly popular in various areas, traditional business gradually shifting from rule-based systems to LLM-based solutions. However, the inference of LLMs is resource-intensive or latency-sensitive, posing significant challenges for serving systems. Existing LLM serving systems often use static or continuous batching strategies, which can lead to inefficient GPU memory utilization and increased latency, especially under heterogeneous workloads. These methods may also struggle to adapt to dynamic workload fluctuations, resulting in suboptimal throughput and potential service level objective (SLO) violations. In this paper, we introduce BucketServe, a bucket-based dynamic batching framework designed to optimize LLM inference performance. By grouping requests into size-homogeneous buckets based on sequence length, BucketServe minimizes padding overhead and optimizes GPU memory usage through real-time batch size adjustments preventing out-of-memory (OOM) errors. It introduces adaptive bucket splitting/merging and priority-aware scheduling to mitigate resource fragmentation and ensure SLO compliance. Experiment shows that BucketServe significantly outperforms UELLM in throughput, achieving up to 3.58x improvement. It can also handle 1.93x more request load under the SLO attainment of 80% compared with DistServe and demonstrates 1.975x higher system load capacity compared to the UELLM.
中文: BucketServe是一种基于分桶的动态批处理框架,通过按序列长度将请求分组到同质桶中,减少填充开销并优化GPU内存使用,相比现有系统显著提升了吞吐量并更好地满足了服务等级目标。
English: BucketServe is a dynamic batching framework that optimizes LLM inference by grouping requests into size-homogeneous buckets, reducing padding overhead and improving GPU memory utilization, resulting in significantly higher throughput and better SLO compliance compared to existing systems.
Authors:Shuai Chen, Fanman Meng, Chunjin Yang, Haoran Wei, Chenhao Wu, Qingbo Wu, Hongliang Li
Abstract:
Cross-Domain Few-Shot Segmentation (CD-FSS) remains challenging due to limited data and domain shifts. Recent foundation models like the Segment Anything Model (SAM) have shown remarkable zero-shot generalization capability in general segmentation tasks, making it a promising solution for few-shot scenarios. However, adapting SAM to CD-FSS faces two critical challenges: reliance on manual prompt and limited cross-domain ability. Therefore, we propose the Composable Meta-Prompt (CMP) framework that introduces three key modules: (i) the Reference Complement and Transformation (RCT) module for semantic expansion, (ii) the Composable Meta-Prompt Generation (CMPG) module for automated meta-prompt synthesis, and (iii) the Frequency-Aware Interaction (FAI) module for domain discrepancy mitigation. Evaluations across four cross-domain datasets demonstrate CMP's state-of-the-art performance, achieving 71.8\% and 74.5\% mIoU in 1-shot and 5-shot scenarios respectively.
跨域少样本分割因数据有限和领域偏移而具挑战性,但提出的可组合元提示框架通过自动生成提示和提升跨域能力克服了这些难题,在单样本和五样本场景下分别实现了71.8%和74.5%的mIoU顶尖性能。
Cross-domain few-shot segmentation is challenging due to limited data and domain shifts, but the proposed Composable Meta-Prompt framework overcomes these by automating prompt generation and enhancing cross-domain performance, achieving state-of-the-art results of 71.8% and 74.5% mIoU in 1-shot and 5-shot settings.
Authors:Shuai Chen, Fanman Meng, Xiwei Zhang, Haoran Wei, Chenhao Wu, Qingbo Wu, Hongliang Li
Abstract:
This paper presents DFR (Decompose, Fuse and Reconstruct), a novel framework that addresses the fundamental challenge of effectively utilizing multi-modal guidance in few-shot segmentation (FSS). While existing approaches primarily rely on visual support samples or textual descriptions, their single or dual-modal paradigms limit exploitation of rich perceptual information available in real-world scenarios. To overcome this limitation, the proposed approach leverages the Segment Anything Model (SAM) to systematically integrate visual, textual, and audio modalities for enhanced semantic understanding. The DFR framework introduces three key innovations: 1) Multi-modal Decompose: a hierarchical decomposition scheme that extracts visual region proposals via SAM, expands textual semantics into fine-grained descriptors, and processes audio features for contextual enrichment; 2) Multi-modal Contrastive Fuse: a fusion strategy employing contrastive learning to maintain consistency across visual, textual, and audio modalities while enabling dynamic semantic interactions between foreground and background features; 3) Dual-path Reconstruct: an adaptive integration mechanism combining semantic guidance from tri-modal fused tokens with geometric cues from multi-modal location priors. Extensive experiments across visual, textual, and audio modalities under both synthetic and real settings demonstrate DFR's substantial performance improvements over state-of-the-art methods.
中文: 本文提出的DFR框架通过分层分解、对比融合和双路径重构,系统整合视觉、文本和音频模态,克服了少样本分割中单/双模态方法的局限,在合成与真实场景下均显著优于现有先进方法。
English: This paper introduces DFR, a novel framework that overcomes the limitations of single or dual-modal approaches in few-shot segmentation by systematically integrating visual, textual, and audio modalities through hierarchical decomposition, contrastive fusion, and dual-path reconstruction, achieving significant performance improvements over existing methods.
Authors:Arpan Dasgupta, Mizhaan Maniyar, Awadhesh Srivastava, Sanat Kumar, Amrita Mahale, Aparna Hedge, Arun Suggala, Karthikeyan Shanmugam, Aparna Taneja, Milind Tambe
Abstract:
Mobile health (mHealth) programs utilize automated voice messages to deliver health information, particularly targeting underserved communities, demonstrating the effectiveness of using mobile technology to disseminate crucial health information to these populations, improving health outcomes through increased awareness and behavioral change. India's Kilkari program delivers vital maternal health information via weekly voice calls to millions of mothers. However, the current random call scheduling often results in missed calls and reduced message delivery. This study presents a field trial of a collaborative bandit algorithm designed to optimize call timing by learning individual mothers' preferred call times. We deployed the algorithm with around $6500$ Kilkari participants as a pilot study, comparing its performance to the baseline random calling approach. Our results demonstrate a statistically significant improvement in call pick-up rates with the bandit algorithm, indicating its potential to enhance message delivery and impact millions of mothers across India. This research highlights the efficacy of personalized scheduling in mobile health interventions and underscores the potential of machine learning to improve maternal health outreach at scale.
中文: 本研究在印度Kilkari移动健康项目中测试了一种协同赌博算法来优化呼叫时间,显著提高了接听率,展示了个性化排程如何有效改善孕产妇健康信息的传递效果。
English: This study tested a collaborative bandit algorithm to optimize call timing in India's Kilkari mHealth program, significantly increasing call pick-up rates and demonstrating how personalized scheduling can enhance maternal health message delivery.
Authors:Lingyu Li, Yang Yao, Yixu Wang, Chubo Li, Yan Teng, Yingchun Wang
Abstract:
As Large Language Models (LLMs) continue to advance, they exhibit certain cognitive patterns similar to those of humans that are not directly specified in training data. This study investigates this phenomenon by focusing on temporal cognition in LLMs. Leveraging the similarity judgment task, we find that larger models spontaneously establish a subjective temporal reference point and adhere to the Weber-Fechner law, whereby the perceived distance logarithmically compresses as years recede from this reference point. To uncover the mechanisms behind this behavior, we conducted multiple analyses across neuronal, representational, and informational levels. We first identify a set of temporal-preferential neurons and find that this group exhibits minimal activation at the subjective reference point and implements a logarithmic coding scheme convergently found in biological systems. Probing representations of years reveals a hierarchical construction process, where years evolve from basic numerical values in shallow layers to abstract temporal orientation in deep layers. Finally, using pre-trained embedding models, we found that the training corpus itself possesses an inherent, non-linear temporal structure, which provides the raw material for the model's internal construction. In discussion, we propose an experientialist perspective for understanding these findings, where the LLMs' cognition is viewed as a subjective construction of the external world by its internal representational system. This nuanced perspective implies the potential emergence of alien cognitive frameworks that humans cannot intuitively predict, pointing toward a direction for AI alignment that focuses on guiding internal constructions. Our code is available at https://TheOtherMind.github.io.
中文: 本研究通过神经元、表征和信息层面的分析发现,大语言模型能自发形成主观时间参照点并遵循对数时间压缩规律,这种类人时间认知源于训练语料的非线性时间结构,揭示了模型通过内部表征系统主观建构外部世界的特点。
English: This study reveals that large language models spontaneously develop human-like temporal cognition, including subjective reference points and logarithmic time compression, through analyses showing specialized neurons and hierarchical representations shaped by non-linear training data.
Authors:Kailai Yang, Xiao Liu, Lei Ji, Hao Li, Yeyun Gong, Peng Cheng, Mao Yang
Abstract:
Continual pre-training on small-scale task-specific data is an effective method for improving large language models in new target fields, yet it risks catastrophic forgetting of their original capabilities. A common solution is to re-weight training data mixtures from source and target fields on a domain space to achieve balanced performance. Previous domain reweighting strategies rely on manual designation with certain heuristics based on human intuition or empirical results. In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. The agent learns generalizable heuristics through reinforcement learning on large quantities of data mixing trajectories with corresponding feedback from an evaluation environment. Experiments in continual pre-training on math reasoning show that Data Mixing Agent outperforms strong baselines in achieving balanced performance across source and target field benchmarks. Furthermore, it generalizes well across unseen source fields, target models, and domain spaces without retraining. Direct application to the code generation field also indicates its adaptability across target domains. Further analysis showcases the agents' well-aligned heuristics with human intuitions and their efficiency in achieving superior model performance with less source-field data.
中文: Data Mixing Agent 是一个基于强化学习的端到端框架,能自动调整源领域和目标领域的训练数据权重,在数学推理和代码生成任务中优于基线方法,有效缓解灾难性遗忘问题。
English: The Data Mixing Agent is an end-to-end framework that uses reinforcement learning to automatically balance training data from source and target fields, outperforming baselines in math reasoning and code generation while mitigating catastrophic forgetting.
Authors:Ruijie Zhu, Mulin Yu, Linning Xu, Lihan Jiang, Yixuan Li, Tianzhu Zhang, Jiangmiao Pang, Bo Dai
Abstract:
3D Gaussian Splatting is renowned for its high-fidelity reconstructions and real-time novel view synthesis, yet its lack of semantic understanding limits object-level perception. In this work, we propose ObjectGS, an object-aware framework that unifies 3D scene reconstruction with semantic understanding. Instead of treating the scene as a unified whole, ObjectGS models individual objects as local anchors that generate neural Gaussians and share object IDs, enabling precise object-level reconstruction. During training, we dynamically grow or prune these anchors and optimize their features, while a one-hot ID encoding with a classification loss enforces clear semantic constraints. We show through extensive experiments that ObjectGS not only outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks, but also integrates seamlessly with applications like mesh extraction and scene editing. Project page: https://ruijiezhu94.github.io/ObjectGS_page
中文: ObjectGS提出了一种对象感知框架,通过将语义理解融入3D高斯泼溅技术,在分割任务和场景编辑等应用中实现了更优的性能。
English: ObjectGS introduces an object-aware framework that enhances 3D Gaussian Splatting by incorporating semantic understanding, enabling superior performance in segmentation tasks and practical applications like scene editing.
Authors:Fan Li, Zanyi Wang, Zeyi Huang, Guang Dai, Jingdong Wang, Mengmeng Wang
Abstract:
3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions, which is crucial for embodied intelligence. Existing 3D visual grounding methods typically rely on separate encoders for different modalities (e.g., RGB images, text, and 3D point clouds), resulting in large and complex models that are inefficient to train. While some approaches use pre-trained 2D multi-modal models like CLIP for 3D tasks, they still struggle with aligning point cloud data to 2D encoders. As a result, these methods continue to depend on 3D encoders for feature extraction, further increasing model complexity and training inefficiency. In this paper, we propose a unified 2D pre-trained multi-modal network to process all three modalities (RGB images, text, and point clouds), significantly simplifying the architecture. By leveraging a 2D CLIP bi-modal model with adapter-based fine-tuning, this framework effectively adapts to the tri-modal setting, improving both adaptability and performance across modalities. Our Geometric-Aware 2D-3D Feature Recovery and Fusion (GARF) module is designed to fuse geometric multi-scale features from point clouds and images. We then integrate textual features for final modality fusion and introduce a multi-modal decoder to facilitate deep cross-modal understanding. Together, our method achieves unified feature extraction and fusion across the three modalities, enabling an end-to-end 3D visual grounding model. Compared to the baseline, our method reduces the number of trainable parameters by approximately 58\%, while achieving a 6.52\% improvement in the 3D detection task and a 6.25\% improvement in the 3D visual grounding task.
中文: 本文提出了一种统一的2D预训练多模态网络,通过单一模型处理RGB图像、文本和点云,简化了3D视觉定位,在减少58%可训练参数的同时,将任务性能提升了6%以上。
English: This paper introduces a unified 2D pre-trained multi-modal network that simplifies 3D visual grounding by processing RGB images, text, and point clouds through a single model, reducing trainable parameters by 58% while improving task performance by over 6%.
Authors:Lilit Grigoryan, Nikolay Karpov, Enas Albasiri, Vitaly Lavrukhin, Boris Ginsburg
Abstract:
Despite Arabic being one of the most widely spoken languages, the development of Arabic Automatic Speech Recognition (ASR) systems faces significant challenges due to the language's complexity, and only a limited number of public Arabic ASR models exist. While much of the focus has been on Modern Standard Arabic (MSA), there is considerably less attention given to the variations within the language. This paper introduces a universal methodology for Arabic speech and text processing designed to address unique challenges of the language. Using this methodology, we train two novel models based on the FastConformer architecture: one designed specifically for MSA and the other, the first unified public model for both MSA and Classical Arabic (CA). The MSA model sets a new benchmark with state-of-the-art (SOTA) performance on related datasets, while the unified model achieves SOTA accuracy with diacritics for CA while maintaining strong performance for MSA. To promote reproducibility, we open-source the models and their training recipes.
中文: 本文提出了一种通用的阿拉伯语语音与文本处理方法,训练出两个基于FastConformer的创新模型——一个专用于现代标准阿拉伯语,另一个是首个统一处理现代标准阿拉伯语和古典阿拉伯语的公共模型,均实现了最优性能并开源以促进可复现性。
English: This paper introduces a universal methodology for Arabic speech and text processing, training two novel FastConformer-based models—one for Modern Standard Arabic and the first unified model for both MSA and Classical Arabic—that achieve state-of-the-art performance and are open-sourced for reproducibility.
Authors:Yuechen Xie, Haobo Jiang, Jian Yang, Yigong Zhang, Jin Xie
Abstract:
In 3D hand-object interaction (HOI) tasks, estimating precise joint poses of hands and objects from monocular RGB input remains highly challenging due to the inherent geometric ambiguity of RGB images and the severe mutual occlusions that occur during interaction.To address these challenges, we propose MaskHOI, a novel Masked Autoencoder (MAE)-driven pretraining framework for enhanced HOI pose estimation. Our core idea is to leverage the masking-then-reconstruction strategy of MAE to encourage the feature encoder to infer missing spatial and structural information, thereby facilitating geometric-aware and occlusion-robust representation learning. Specifically, based on our observation that human hands exhibit far greater geometric complexity than rigid objects, conventional uniform masking fails to effectively guide the reconstruction of fine-grained hand structures. To overcome this limitation, we introduce a Region-specific Mask Ratio Allocation, primarily comprising the region-specific masking assignment and the skeleton-driven hand masking guidance. The former adaptively assigns lower masking ratios to hand regions than to rigid objects, balancing their feature learning difficulty, while the latter prioritizes masking critical hand parts (e.g., fingertips or entire fingers) to realistically simulate occlusion patterns in real-world interactions. Furthermore, to enhance the geometric awareness of the pretrained encoder, we introduce a novel Masked Signed Distance Field (SDF)-driven multimodal learning mechanism. Through the self-masking 3D SDF prediction, the learned encoder is able to perceive the global geometric structure of hands and objects beyond the 2D image plane, overcoming the inherent limitations of monocular input and alleviating self-occlusion issues. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches.
中文:MaskHOI提出了一种基于掩码自编码器的预训练框架,通过区域特定掩蔽和符号距离场多模态学习,有效提升了三维手物交互姿态估计的几何感知与遮挡鲁棒性。
English: MaskHOI introduces a masked autoencoder framework with region-specific masking and SDF-based multimodal learning to enhance 3D hand-object pose estimation by improving geometric awareness and occlusion robustness.
Authors:Shiye Cao, Maia Stiber, Amama Mahmood, Maria Teresa Parreira, Wendy Ju, Micol Spitale, Hatice Gunes, Chien-Ming Huang
Abstract:
The integration of large language models (LLMs) into conversational robots has made human-robot conversations more dynamic. Yet, LLM-powered conversational robots remain prone to errors, e.g., misunderstanding user intent, prematurely interrupting users, or failing to respond altogether. Detecting and addressing these failures is critical for preventing conversational breakdowns, avoiding task disruptions, and sustaining user trust. To tackle this problem, the ERR@HRI 2.0 Challenge provides a multimodal dataset of LLM-powered conversational robot failures during human-robot conversations and encourages researchers to benchmark machine learning models designed to detect robot failures. The dataset includes 16 hours of dyadic human-robot interactions, incorporating facial, speech, and head movement features. Each interaction is annotated with the presence or absence of robot errors from the system perspective, and perceived user intention to correct for a mismatch between robot behavior and user expectation. Participants are invited to form teams and develop machine learning models that detect these failures using multimodal data. Submissions will be evaluated using various performance metrics, including detection accuracy and false positive rate. This challenge represents another key step toward improving failure detection in human-robot interaction through social signal analysis.
中文: ERR@HRI 2.0挑战赛通过提供多模态对话机器人故障数据集,推动机器学习模型在检测人机交互故障方面的研究,以增强系统可靠性和用户信任度。
English: The ERR@HRI 2.0 Challenge introduces a multimodal dataset of conversational robot failures to advance machine learning models for detecting errors in human-robot interactions, thereby improving reliability and user trust.
Authors:Shiye Cao, Maia Stiber, Amama Mahmood, Maria Teresa Parreira, Wendy Ju, Micol Spitale, Hatice Gunes, Chien-Ming Huang
Abstract:
The integration of large language models (LLMs) into conversational robots has made human-robot conversations more dynamic. Yet, LLM-powered conversational robots remain prone to errors, e.g., misunderstanding user intent, prematurely interrupting users, or failing to respond altogether. Detecting and addressing these failures is critical for preventing conversational breakdowns, avoiding task disruptions, and sustaining user trust. To tackle this problem, the ERR@HRI 2.0 Challenge provides a multimodal dataset of LLM-powered conversational robot failures during human-robot conversations and encourages researchers to benchmark machine learning models designed to detect robot failures. The dataset includes 16 hours of dyadic human-robot interactions, incorporating facial, speech, and head movement features. Each interaction is annotated with the presence or absence of robot errors from the system perspective, and perceived user intention to correct for a mismatch between robot behavior and user expectation. Participants are invited to form teams and develop machine learning models that detect these failures using multimodal data. Submissions will be evaluated using various performance metrics, including detection accuracy and false positive rate. This challenge represents another key step toward improving failure detection in human-robot interaction through social signal analysis.
中文: ERR@HRI 2.0挑战赛通过提供多模态对话机器人故障数据集,推动机器学习模型在检测人机交互故障方面的研究,以增强系统可靠性和用户信任度。
English: The ERR@HRI 2.0 Challenge introduces a multimodal dataset of conversational robot failures to advance machine learning models for detecting errors in human-robot interactions, thereby improving reliability and user trust.
Authors:Yaowen Hu, Wenxuan Tu, Yue Liu, Xinhang Wan, Junyi Yan, Taichun Zhou, Xinwang Liu
Abstract:
Deep graph clustering (DGC), which aims to unsupervisedly separate the nodes in an attribute graph into different clusters, has seen substantial potential in various industrial scenarios like community detection and recommendation. However, the real-world attribute graphs, e.g., social networks interactions, are usually large-scale and attribute-missing. To solve these two problems, we propose a novel DGC method termed \underline{\textbf{C}}omplementary \underline{\textbf{M}}ulti-\underline{\textbf{V}}iew \underline{\textbf{N}}eighborhood \underline{\textbf{D}}ifferentiation (\textit{CMV-ND}), which preprocesses graph structural information into multiple views in a complete but non-redundant manner. First, to ensure completeness of the structural information, we propose a recursive neighborhood search that recursively explores the local structure of the graph by completely expanding node neighborhoods across different hop distances. Second, to eliminate the redundancy between neighborhoods at different hops, we introduce a neighborhood differential strategy that ensures no overlapping nodes between the differential hop representations. Then, we construct $K+1$ complementary views from the $K$ differential hop representations and the features of the target node. Last, we apply existing multi-view clustering or DGC methods to the views. Experimental results on six widely used graph datasets demonstrate that CMV-ND significantly improves the performance of various methods.
中文: 深度图聚类旨在无监督地将属性图中的节点分组,所提出的CMV-ND方法通过递归邻域搜索和差分策略将结构信息预处理为互补的多视图表示,显著提升了各种方法的性能。
English: Deep graph clustering addresses unsupervised node grouping in attribute graphs, with the proposed CMV-ND method enhancing performance by preprocessing structural information into complementary multi-view representations through recursive neighborhood search and differential strategies.
Authors:Jiadong Chen, Hengyu Ye, Fuxin Jiang, Xiao He, Tieying Zhang, Jianjun Chen, Xiaofeng Gao
Abstract:
Workload forecasting is pivotal in cloud service applications, such as auto-scaling and scheduling, with profound implications for operational efficiency. Although Transformer-based forecasting models have demonstrated remarkable success in general tasks, their computational efficiency often falls short of the stringent requirements in large-scale cloud environments. Given that most workload series exhibit complicated periodic patterns, addressing these challenges in the frequency domain offers substantial advantages. To this end, we propose Fremer, an efficient and effective deep forecasting model. Fremer fulfills three critical requirements: it demonstrates superior efficiency, outperforming most Transformer-based forecasting models; it achieves exceptional accuracy, surpassing all state-of-the-art (SOTA) models in workload forecasting; and it exhibits robust performance for multi-period series. Furthermore, we collect and open-source four high-quality, open-source workload datasets derived from ByteDance's cloud services, encompassing workload data from thousands of computing instances. Extensive experiments on both our proprietary datasets and public benchmarks demonstrate that Fremer consistently outperforms baseline models, achieving average improvements of 5.5% in MSE, 4.7% in MAE, and 8.6% in SMAPE over SOTA models, while simultaneously reducing parameter scale and computational costs. Additionally, in a proactive auto-scaling test based on Kubernetes, Fremer improves average latency by 18.78% and reduces resource consumption by 2.35%, underscoring its practical efficacy in real-world applications.
中文: Fremer是一种高效的深度预测模型,在云工作负载预测中不仅精度超越现有最优模型,还显著提升了计算效率,并在实际应用中表现出强大的性能。
English: Fremer is an efficient deep forecasting model that surpasses state-of-the-art models in accuracy and computational efficiency for cloud workload forecasting, demonstrating robust performance in real-world applications.
Authors:Huakang Chen, Yuepeng Jiang, Guobin Ma, Chunbo Hao, Shuai Wang, Jixun Yao, Ziqian Ning, Meng Meng, Jian Luan, Lei Xie
Abstract:
Songs, as a central form of musical art, exemplify the richness of human intelligence and creativity. While recent advances in generative modeling have enabled notable progress in long-form song generation, current systems for full-length song synthesis still face major challenges, including data imbalance, insufficient controllability, and inconsistent musical quality. DiffRhythm, a pioneering diffusion-based model, advanced the field by generating full-length songs with expressive vocals and accompaniment. However, its performance was constrained by an unbalanced model training dataset and limited controllability over musical style, resulting in noticeable quality disparities and restricted creative flexibility. To address these limitations, we propose DiffRhythm+, an enhanced diffusion-based framework for controllable and flexible full-length song generation. DiffRhythm+ leverages a substantially expanded and balanced training dataset to mitigate issues such as repetition and omission of lyrics, while also fostering the emergence of richer musical skills and expressiveness. The framework introduces a multi-modal style conditioning strategy, enabling users to precisely specify musical styles through both descriptive text and reference audio, thereby significantly enhancing creative control and diversity. We further introduce direct performance optimization aligned with user preferences, guiding the model toward consistently preferred outputs across evaluation metrics. Extensive experiments demonstrate that DiffRhythm+ achieves significant improvements in naturalness, arrangement complexity, and listener satisfaction over previous systems.
中文摘要:DiffRhythm+是一种增强的基于扩散的框架,通过采用平衡的训练数据集和多模态风格调节策略,解决了全长歌曲生成中的质量不一致和可控性不足问题,显著提升了自然度、编曲复杂度和听众满意度。
English Summary: DiffRhythm+ is an enhanced diffusion-based framework that addresses previous limitations in full-length song generation by using a balanced training dataset and multi-modal style conditioning to improve musical quality, controllability, and listener satisfaction.
Authors:Yifei Zhou, Xuchu Huang, Chenyu Ni, Min Zhou, Zheyu Yan, Xunzhao Yin, Cheng Zhuo
Abstract:
Neuro-symbolic artificial intelligence (neuro-symbolic AI) excels in logical analysis and reasoning. Hyperdimensional Computing (HDC), a promising brain-inspired computational model, is integral to neuro-symbolic AI. Various HDC models have been proposed to represent class-instance and class-class relations, but when representing the more complex class-subclass relation, where multiple objects associate different levels of classes and subclasses, they face challenges for factorization, a crucial task for neuro-symbolic AI systems. In this article, we propose FactorHD, a novel HDC model capable of representing and factorizing the complex class-subclass relation efficiently. FactorHD features a symbolic encoding method that embeds an extra memorization clause, preserving more information for multiple objects. In addition, it employs an efficient factorization algorithm that selectively eliminates redundant classes by identifying the memorization clause of the target class. Such model significantly enhances computing efficiency and accuracy in representing and factorizing multiple objects with class-subclass relation, overcoming limitations of existing HDC models such as "superposition catastrophe" and "the problem of 2". Evaluations show that FactorHD achieves approximately 5667x speedup at a representation size of 10^9 compared to existing HDC models. When integrated with the ResNet-18 neural network, FactorHD achieves 92.48% factorization accuracy on the Cifar-10 dataset.
Chinese: 本文提出FactorHD模型,这是一种新型超维度计算模型,通过带记忆子句的符号编码和选择性消除算法,能高效表示和分解复杂的类-子类关系,在神经符号人工智能系统中实现了显著的速度提升和高精度。
English: This paper introduces FactorHD, a novel hyperdimensional computing model that efficiently represents and factorizes complex class-subclass relations through symbolic encoding with memorization clauses and selective elimination algorithms, achieving significant speed improvements and high accuracy in neuro-symbolic AI systems.
Authors:Jiawei Xu, Kai Deng, Zexin Fan, Shenlong Wang, Jin Xie, Jian Yang
Abstract:
Modeling and rendering dynamic urban driving scenes is crucial for self-driving simulation. Current high-quality methods typically rely on costly manual object tracklet annotations, while self-supervised approaches fail to capture dynamic object motions accurately and decompose scenes properly, resulting in rendering artifacts. We introduce AD-GS, a novel self-supervised framework for high-quality free-viewpoint rendering of driving scenes from a single log. At its core is a novel learnable motion model that integrates locality-aware B-spline curves with global-aware trigonometric functions, enabling flexible yet precise dynamic object modeling. Rather than requiring comprehensive semantic labeling, AD-GS automatically segments scenes into objects and background with the simplified pseudo 2D segmentation, representing objects using dynamic Gaussians and bidirectional temporal visibility masks. Further, our model incorporates visibility reasoning and physically rigid regularization to enhance robustness. Extensive evaluations demonstrate that our annotation-free model significantly outperforms current state-of-the-art annotation-free methods and is competitive with annotation-dependent approaches.
中文: AD-GS是一种新型自监督框架,通过结合可学习运动模型和自动场景分解,实现了动态驾驶场景的高质量自由视角渲染,其性能显著优于现有无标注方法,并能与依赖标注的方法相媲美。
English: AD-GS is a novel self-supervised framework that enables high-quality free-viewpoint rendering of dynamic driving scenes by integrating a learnable motion model and automatic scene decomposition, outperforming existing annotation-free methods and competing with annotation-dependent approaches.
Authors:Keke Gai, Haochen Liang, Jing Yu, Liehuang Zhu, Dusit Niyato
Abstract:
Smart contracts play a pivotal role in blockchain ecosystems, and fuzzing remains an important approach to securing smart contracts. Even though mutation scheduling is a key factor influencing fuzzing effectiveness, existing fuzzers have primarily explored seed scheduling and generation, while mutation scheduling has been rarely addressed by prior work. In this work, we propose a Large Language Models (LLMs)-based Multi-feedback Smart Contract Fuzzing framework (LLAMA) that integrates LLMs, evolutionary mutation strategies, and hybrid testing techniques. Key components of the proposed LLAMA include: (i) a hierarchical prompting strategy that guides LLMs to generate semantically valid initial seeds, coupled with a lightweight pre-fuzzing phase to select high-potential inputs; (ii) a multi-feedback optimization mechanism that simultaneously improves seed generation, seed selection, and mutation scheduling by leveraging runtime coverage and dependency feedback; and (iii) an evolutionary fuzzing engine that dynamically adjusts mutation operator probabilities based on effectiveness, while incorporating symbolic execution to escape stagnation and uncover deeper vulnerabilities. Our experiments demonstrate that LLAMA outperforms state-of-the-art fuzzers in both coverage and vulnerability detection. Specifically, it achieves 91% instruction coverage and 90% branch coverage, while detecting 132 out of 148 known vulnerabilities across diverse categories. These results highlight LLAMA's effectiveness, adaptability, and practicality in real-world smart contract security testing scenarios.
中文: 本文提出的LLAMA框架融合大语言模型与进化变异策略,通过多层次反馈优化机制显著提升了智能合约模糊测试的代码覆盖率和漏洞检测能力,实验证明其性能优于现有最优方法。
English: This paper introduces LLAMA, an advanced smart contract fuzzing framework that leverages Large Language Models, evolutionary mutation strategies, and hybrid testing to significantly enhance code coverage and vulnerability detection beyond current state-of-the-art methods.
Authors:Xinhang Wan, Jiyuan Liu, Qian Qu, Suyuan Liu, Chuyu Zhang, Fangdi Wang, Xinwang Liu, En Zhu, Kunlun He
Abstract:
In this paper, we address the problem of novel class discovery (NCD), which aims to cluster novel classes by leveraging knowledge from disjoint known classes. While recent advances have made significant progress in this area, existing NCD methods face two major limitations. First, they primarily focus on single-view data (e.g., images), overlooking the increasingly common multi-view data, such as multi-omics datasets used in disease diagnosis. Second, their reliance on pseudo-labels to supervise novel class clustering often results in unstable performance, as pseudo-label quality is highly sensitive to factors such as data noise and feature dimensionality. To address these challenges, we propose a novel framework named Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery (IICMVNCD), which is the first attempt to explore NCD in multi-view setting so far. Specifically, at the intra-view level, leveraging the distributional similarity between known and novel classes, we employ matrix factorization to decompose features into view-specific shared base matrices and factor matrices. The base matrices capture distributional consistency among the two datasets, while the factor matrices model pairwise relationships between samples. At the inter-view level, we utilize view relationships among known classes to guide the clustering of novel classes. This includes generating predicted labels through the weighted fusion of factor matrices and dynamically adjusting view weights of known classes based on the supervision loss, which are then transferred to novel class learning. Experimental results validate the effectiveness of our proposed approach.
中文: 本文提出首个多视图新类发现框架IICMVNCD,通过矩阵分解捕捉视图内分布一致性,并利用已知类视图间关系自适应指导新类聚类,有效解决了传统方法对多视图数据适应性不足和伪标签依赖问题。
English: This paper introduces IICMVNCD, the first framework addressing novel class discovery for multi-view data by leveraging intra-view distributional consistency through matrix factorization and inter-view relationships via adaptive weight transfer from known to novel classes.
Authors:Xiang Yin, Nico Potyka, Antonio Rago, Timotheus Kampik, Francesca Toni
Abstract:
Contestable AI requires that AI-driven decisions align with human preferences. While various forms of argumentation have been shown to support contestability, Edge-Weighted Quantitative Bipolar Argumentation Frameworks (EW-QBAFs) have received little attention. In this work, we show how EW-QBAFs can be deployed for this purpose. Specifically, we introduce the contestability problem for EW-QBAFs, which asks how to modify edge weights (e.g., preferences) to achieve a desired strength for a specific argument of interest (i.e., a topic argument). To address this problem, we propose gradient-based relation attribution explanations (G-RAEs), which quantify the sensitivity of the topic argument's strength to changes in individual edge weights, thus providing interpretable guidance for weight adjustments towards contestability. Building on G-RAEs, we develop an iterative algorithm that progressively adjusts the edge weights to attain the desired strength. We evaluate our approach experimentally on synthetic EW-QBAFs that simulate the structural characteristics of personalised recommender systems and multi-layer perceptrons, and demonstrate that it can solve the problem effectively.
中文摘要:本研究通过开发基于梯度的关系归因解释方法和迭代算法,展示了边缘加权定量双极论证框架如何通过有效调整边权重来实现预期论证强度,从而增强人工智能的可争议性。
English Summary: This study demonstrates how Edge-Weighted Quantitative Bipolar Argumentation Frameworks (EW-QBAFs) can enhance AI contestability by developing gradient-based relation attribution explanations and an iterative algorithm that effectively adjusts edge weights to achieve desired argument strengths.
Authors:Gabriele Formis, Amanda Ericson, Stefan Forsstrom, Kyi Thar, Gianluca Cena, Stefano Scanzio
Abstract:
The increasing need for robustness, reliability, and determinism in wireless networks for industrial and mission-critical applications is the driver for the growth of new innovative methods. The study presented in this work makes use of machine learning techniques to predict channel quality in a Wi-Fi network in terms of the frame delivery ratio. Predictions can be used proactively to adjust communication parameters at runtime and optimize network operations for industrial applications. Methods including convolutional neural networks and long short-term memory were analyzed on datasets acquired from a real Wi-Fi setup across multiple channels. The models were compared in terms of prediction accuracy and computational complexity. Results show that the frame delivery ratio can be reliably predicted, and convolutional neural networks, although slightly less effective than other models, are more efficient in terms of CPU usage and memory consumption. This enhances the model's usability on embedded and industrial systems.
中文: 本研究采用机器学习技术,包括卷积神经网络和长短期记忆网络,通过帧传输率预测Wi-Fi信道质量,可为工业应用实现主动网络优化,其中卷积神经网络在嵌入式系统中展现出最佳效率。
English: This study employs machine learning techniques, including convolutional neural networks and long short-term memory, to predict Wi-Fi channel quality by frame delivery ratio, enabling proactive network optimization for industrial applications with convolutional neural networks proving most efficient for embedded systems.
Authors:Zongtao He, Liuyi Wang, Lu Chen, Chengju Liu, Qijun Chen
Abstract:
Language-guided navigation is a cornerstone of embodied AI, enabling agents to interpret language instructions and navigate complex environments. However, expert-provided instructions are limited in quantity, while synthesized annotations often lack quality, making them insufficient for large-scale research. To address this, we propose NavComposer, a novel framework for automatically generating high-quality navigation instructions. NavComposer explicitly decomposes semantic entities such as actions, scenes, and objects, and recomposes them into natural language instructions. Its modular architecture allows flexible integration of state-of-the-art techniques, while the explicit use of semantic entities enhances both the richness and accuracy of instructions. Moreover, it operates in a data-agnostic manner, supporting adaptation to diverse navigation trajectories without domain-specific training. Complementing NavComposer, we introduce NavInstrCritic, a comprehensive annotation-free evaluation system that assesses navigation instructions on three dimensions: contrastive matching, semantic consistency, and linguistic diversity. NavInstrCritic provides a holistic evaluation of instruction quality, addressing limitations of traditional metrics that rely heavily on expert annotations. By decoupling instruction generation and evaluation from specific navigation agents, our method enables more scalable and generalizable research. Extensive experiments provide direct and practical evidence for the effectiveness of our method.
中文: NavComposer通过分解和重组语义实体来自动生成高质量导航指令,而NavInstrCritic则提供了一个无需标注的评估系统,从多个维度全面评估指令质量。
English: NavComposer is a novel framework that automatically generates high-quality navigation instructions by decomposing and recomposing semantic entities, while NavInstrCritic provides an annotation-free evaluation system to assess instruction quality across multiple dimensions.
Authors:Yaowen Hu, Wenxuan Tu, Yue Liu, Miaomiao Li, Wenpeng Lu, Zhigang Luo, Xinwang Liu, Ping Chen
Abstract:
Deep graph clustering (DGC) for attribute-missing graphs is an unsupervised task aimed at partitioning nodes with incomplete attributes into distinct clusters. Addressing this challenging issue is vital for practical applications. However, research in this area remains underexplored. Existing imputation methods for attribute-missing graphs often fail to account for the varying amounts of information available across node neighborhoods, leading to unreliable results, especially for nodes with insufficient known neighborhood. To address this issue, we propose a novel method named Divide-Then-Rule Graph Completion (DTRGC). This method first addresses nodes with sufficient known neighborhood information and treats the imputed results as new knowledge to iteratively impute more challenging nodes, while leveraging clustering information to correct imputation errors. Specifically, Dynamic Cluster-Aware Feature Propagation (DCFP) initializes missing node attributes by adjusting propagation weights based on the clustering structure. Subsequently, Hierarchical Neighborhood-aware Imputation (HNAI) categorizes attribute-missing nodes into three groups based on the completeness of their neighborhood attributes. The imputation is performed hierarchically, prioritizing the groups with nodes that have the most available neighborhood information. The cluster structure is then used to refine the imputation and correct potential errors. Finally, Hop-wise Representation Enhancement (HRE) integrates information across multiple hops, thereby enriching the expressiveness of node representations. Experimental results on six widely used graph datasets show that DTRGC significantly improves the clustering performance of various DGC methods under attribute-missing graphs.
中文摘要:提出的Divide-Then-Rule图补全方法通过基于聚类结构和邻域完整性的分层补全策略,有效处理属性缺失图的节点聚类问题,在六个基准数据集上显著提升了深度图聚类性能。
English Summary: The proposed Divide-Then-Rule Graph Completion (DTRGC) method addresses attribute-missing graphs by hierarchically imputing missing node attributes using clustering information and neighborhood completeness, significantly improving deep graph clustering performance across six benchmark datasets.
Authors:Daniil Orel, Indraneil Paul, Iryna Gurevych, Preslav Nakov
Abstract:
In this work, we compile $\textbf{$\texttt{DroidCollection}$}$, the most extensive open data suite for training and evaluating machine-generated code detectors, comprising over a million code samples, seven programming languages, outputs from 43 coding models, and over three real-world coding domains. Alongside fully AI-generated samples, our collection includes human-AI co-authored code, as well as adversarial samples explicitly crafted to evade detection. Subsequently, we develop $\textbf{$\texttt{DroidDetect}$}$, a suite of encoder-only detectors trained using a multi-task objective over $\texttt{DroidCollection}$. Our experiments show that existing detectors' performance fails to generalise to diverse coding domains and programming languages outside of their narrow training data. Additionally, we demonstrate that while most detectors are easily compromised by humanising the output distributions using superficial prompting and alignment approaches, this problem can be easily amended by training on a small amount of adversarial data. Finally, we demonstrate the effectiveness of metric learning and uncertainty-based resampling as means to enhance detector training on possibly noisy distributions.
中文: 本研究推出了DroidCollection,这是用于检测机器生成代码的最大开源数据集,包含跨多种语言和模型的百万余样本,并开发了DroidDetect检测器套件,通过多任务训练和数据增强解决了现有检测器泛化能力不足和对抗性规避问题,显著提升了检测性能。
English: This study introduces DroidCollection, the largest open dataset for detecting machine-generated code, featuring over a million samples across multiple languages and models, and DroidDetect, a detector suite that outperforms existing methods by addressing generalization issues and adversarial evasion through multi-task training and data enhancements.
Authors:Anna Rapberger, Fabrizio Russo, Antonio Rago, Francesca Toni
Abstract:
In computational argumentation, gradual semantics are fine-grained alternatives to extension-based and labelling-based semantics . They ascribe a dialectical strength to (components of) arguments sanctioning their degree of acceptability. Several gradual semantics have been studied for abstract, bipolar and quantitative bipolar argumentation frameworks (QBAFs), as well as, to a lesser extent, for some forms of structured argumentation. However, this has not been the case for assumption-based argumentation (ABA), despite it being a popular form of structured argumentation with several applications where gradual semantics could be useful. In this paper, we fill this gap and propose a family of novel gradual semantics for equipping assumptions, which are the core components in ABA frameworks, with dialectical strengths. To do so, we use bipolar set-based argumentation frameworks as an abstraction of (potentially non-flat) ABA frameworks and generalise state-of-the-art modular gradual semantics for QBAFs. We show that our gradual ABA semantics satisfy suitable adaptations of desirable properties of gradual QBAF semantics, such as balance and monotonicity. We also explore an argument-based approach that leverages established QBAF modular semantics directly, and use it as baseline. Finally, we conduct experiments with synthetic ABA frameworks to compare our gradual ABA semantics with its argument-based counterpart and assess convergence.
Chinese: 本文为基于假设的论辩(ABA)提出了一系列新颖的渐进语义,通过量化双极论辩框架(QBAF)为假设赋予辩证强度,并验证了其满足平衡性和单调性等理想特性。
English: This paper introduces a novel family of gradual semantics for assumption-based argumentation (ABA), adapting modular quantitative bipolar argumentation frameworks (QBAFs) to assign dialectical strengths to assumptions and demonstrating their adherence to properties like balance and monotonicity.
Authors:Pengyu Liu, Kun Li, Fei Wang, Yanyan Wei, Junhui She, Dan Guo
Abstract:
In this paper, we introduce the latest solution developed by our team, HFUT-VUT, for the Micro-gesture Online Recognition track of the IJCAI 2025 MiGA Challenge. The Micro-gesture Online Recognition task is a highly challenging problem that aims to locate the temporal positions and recognize the categories of multiple micro-gesture instances in untrimmed videos. Compared to traditional temporal action detection, this task places greater emphasis on distinguishing between micro-gesture categories and precisely identifying the start and end times of each instance. Moreover, micro-gestures are typically spontaneous human actions, with greater differences than those found in other human actions. To address these challenges, we propose hand-crafted data augmentation and spatial-temporal attention to enhance the model's ability to classify and localize micro-gestures more accurately. Our solution achieved an F1 score of 38.03, outperforming the previous state-of-the-art by 37.9%. As a result, our method ranked first in the Micro-gesture Online Recognition track.
Chinese: 我们HFUT-VUT团队针对IJCAI 2025 MiGA挑战赛提出的解决方案,通过手工数据增强和时空注意力机制,在未剪辑视频中显著提升了微手势分类与定位能力,以38.03的F1分数荣获该赛道第一名。
English: Our HFUT-VUT team's solution for the IJCAI 2025 MiGA Challenge, featuring hand-crafted data augmentation and spatial-temporal attention, achieved a top-ranking F1 score of 38.03 by improving micro-gesture classification and localization in untrimmed videos.
Authors:Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian
Abstract:
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.
中文: 视频扩散模型常缺乏几何感知能力,因此我们提出几何强制方法,将其表示与预训练的几何模型对齐,显著提升了视频生成的3D一致性和视觉质量。
English: Video diffusion models often lack geometric awareness, so we propose Geometry Forcing to align their representations with a pretrained geometric model, significantly improving 3D consistency and visual quality in video generation.
Authors:Mikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara
Abstract:
Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.
Chinese: 本研究首次将语音活动预测(VAP)应用于三方对话,通过在日语多人对话数据集上的训练验证了其在话轮转换预测中的有效性,但发现对话类型会影响预测准确率。
English: This study pioneers the application of voice activity projection (VAP) to triadic conversations, demonstrating its effectiveness in predicting turn-taking through training on a Japanese multi-party dialogue dataset, though accuracy varies with conversation type.
Authors:Mikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara
Abstract:
Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.
Chinese: 本研究首次将语音活动预测(VAP)应用于三方对话,通过在日语多人对话数据集上的训练验证了其在话轮转换预测中的有效性,但发现对话类型会影响预测准确率。
English: This study pioneers the application of voice activity projection (VAP) to triadic conversations, demonstrating its effectiveness in predicting turn-taking through training on a Japanese multi-party dialogue dataset, though accuracy varies with conversation type.
Authors:Zongwei Wang, Min Gao, Junliang Yu, Shazia Sadiq, Hongzhi Yin, Ling Liu
Abstract:
Graph Contrastive Learning (GCL) has demonstrated substantial promise in enhancing the robustness and generalization of recommender systems, particularly by enabling models to leverage large-scale unlabeled data for improved representation learning. However, in this paper, we reveal an unexpected vulnerability: the integration of GCL inadvertently increases the susceptibility of a recommender to targeted promotion attacks. Through both theoretical investigation and empirical validation, we identify the root cause as the spectral smoothing effect induced by contrastive optimization, which disperses item embeddings across the representation space and unintentionally enhances the exposure of target items. Building on this insight, we introduce CLeaR, a bi-level optimization attack method that deliberately amplifies spectral smoothness, enabling a systematic investigation of the susceptibility of GCL-based recommendation models to targeted promotion attacks. Our findings highlight the urgent need for robust countermeasures; in response, we further propose SIM, a spectral irregularity mitigation framework designed to accurately detect and suppress targeted items without compromising model performance. Extensive experiments on multiple benchmark datasets demonstrate that, compared to existing targeted promotion attacks, GCL-based recommendation models exhibit greater susceptibility when evaluated with CLeaR, while SIM effectively mitigates these vulnerabilities.
中文: 图对比学习虽能增强推荐系统的鲁棒性,却因谱平滑效应增加了对定向推广攻击的敏感性,为此我们提出了CLeaR攻击方法和SIM防御框架。
English: Graph Contrastive Learning enhances recommender systems but increases vulnerability to targeted promotion attacks due to spectral smoothing, leading to the development of the CLeaR attack method and the SIM mitigation framework.
Authors:Ruiqiang Li, Brian Yecies, Qin Wang, Shiping Chen, Jun Shen
Abstract:
Non-Fungible Tokens (NFTs) offer a promising mechanism to protect Australian and Indigenous artists' copyright. They represent and transfer the value of artwork in digital form. Before adopting NFTs to protect Australian artwork, we in this paper investigate them empericially. We focus on examining the details of NFT structure. We start from the underlying structure of NFTs to show how they represent copyright for both artists and production owners, as well as how they aim to safeguard or secure the value of digital artworks. We then involve data collection from various types of sources with different storage methods, including on-chain, centralized, and decentralized systems. Based on both metadata and artwork content, we present our analysis and discussion on the following key issues: copyright, security and artist identification. The final results of the evaluation, unfortnately, show that the NFT is NOT ready to protect Australian and Indigenous artists' copyright.
中文: 本研究通过结构分析和多源数据评估,实证检验了NFT保护澳大利亚及原住民艺术家版权的潜力,最终结论表明NFT目前尚未具备此项能力。
English: This study empirically examines NFTs' potential to protect Australian and Indigenous artists' copyright through structural analysis and multi-source data evaluation, concluding that NFTs are currently not ready for this purpose.
Authors:Qingsen Yan, Kangbiao Shi, Yixu Feng, Tao Hu, Peng Wu, Guansong Pang, Yanning Zhang
Abstract:
Low-Light Image Enhancement (LLIE) aims to restore vivid content and details from corrupted low-light images. However, existing standard RGB (sRGB) color space-based LLIE methods often produce color bias and brightness artifacts due to the inherent high color sensitivity. While Hue, Saturation, and Value (HSV) color space can decouple brightness and color, it introduces significant red and black noise artifacts. To address this problem, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by the HV color map and learnable intensity. The HV color map enforces small distances for the red coordinates to remove red noise artifacts, while the learnable intensity compresses the low-light regions to remove black noise artifacts. Additionally, we introduce the Color and Intensity Decoupling Network+ (HVI-CIDNet+), built upon the HVI color space, to restore damaged content and mitigate color distortion in extremely dark regions. Specifically, HVI-CIDNet+ leverages abundant contextual and degraded knowledge extracted from low-light images using pre-trained vision-language models, integrated via a novel Prior-guided Attention Block (PAB). Within the PAB, latent semantic priors can promote content restoration, while degraded representations guide precise color correction, both particularly in extremely dark regions through the meticulously designed cross-attention fusion mechanism. Furthermore, we construct a Region Refinement Block that employs convolution for information-rich regions and self-attention for information-scarce regions, ensuring accurate brightness adjustments. Comprehensive results from benchmark experiments demonstrate that the proposed HVI-CIDNet+ outperforms the state-of-the-art methods on 10 datasets.
中文摘要:提出的HVI色彩空间和HVI-CIDNet+模型能有效消除低光图像增强中的红黑噪点伪影,在极暗区域恢复内容并校正色彩,在多个数据集上实现了最优性能。
English Summary: The proposed HVI color space and HVI-CIDNet+ model effectively eliminate red and black noise artifacts in low-light image enhancement while restoring content and correcting colors, particularly in extremely dark regions, achieving state-of-the-art performance across multiple datasets.
Authors:Farahdiba Zarin, Riccardo Oliva, Vinkle Srivastav, Armine Vardazaryan, Andrea Rosati, Alice Zampolini Faustini, Giovanni Scambia, Anna Fagotti, Pietro Mascagni, Nicolas Padoy
Abstract:
Learning from sparse labels is a challenge commonplace in the medical domain. This is due to numerous factors, such as annotation cost, and is especially true for newly introduced tasks. When dense pixel-level annotations are needed, this becomes even more unfeasible. However, being able to learn from just a few annotations at the pixel-level, while extremely difficult and underutilized, can drive progress in studies where perfect annotations are not immediately available. This work tackles the challenge of learning the dense prediction task of keypoint localization from a few point annotations in the context of 2d carcinosis keypoint localization from laparoscopic video frames for diagnostic planning of advanced ovarian cancer patients. To enable this, we formulate the problem as a sparse heatmap regression from a few point annotations per image and propose a new loss function, called Crag and Tail loss, for efficient learning. Our proposed loss function effectively leverages positive sparse labels while minimizing the impact of false negatives or missed annotations. Through an extensive ablation study, we demonstrate the effectiveness of our approach in achieving accurate dense localization of carcinosis keypoints, highlighting its potential to advance research in scenarios where dense annotations are challenging to obtain.
中文摘要:本研究提出Crag and Tail损失函数,通过稀疏热图回归方法解决医学影像中标注稀疏的难题,在仅需少量像素级标注的情况下实现了卵巢癌腹膜转移关键点的精确定位。
English Summary: This study introduces the Crag and Tail loss function to address sparse label challenges in medical imaging, enabling accurate dense keypoint localization for ovarian cancer diagnosis from limited pixel-level annotations.
Authors:Yang Chen, Yueqi Duan, Haowen Sun, Ziwei Wang, Jiwen Lu, Yap-Peng Tan
Abstract:
In this paper, we propose view-dependent projection (VDP) to facilitate point cloud segmentation, designing efficient 3D-to-2D mapping that dynamically adapts to the spatial geometry from view variations. Existing projection-based methods leverage view-independent projection in complex scenes, relying on straight lines to generate direct rays or upward curves to reduce occlusions. However, their view independence provides projection rays that are limited to pre-defined parameters by human settings, restricting point awareness and failing to capture sufficient projection diversity across different view planes. Although multiple projections per view plane are commonly used to enhance spatial variety, the projected redundancy leads to excessive computational overhead and inefficiency in image processing. To address these limitations, we design a framework of VDP to generate data-driven projections from 3D point distributions, producing highly informative single-image inputs by predicting rays inspired by the adaptive behavior of fireworks. In addition, we construct color regularization to optimize the framework, which emphasizes essential features within semantic pixels and suppresses the non-semantic features within black pixels, thereby maximizing 2D space utilization in a projected image. As a result, our approach, PointVDP, develops lightweight projections in marginal computation costs. Experiments on S3DIS and ScanNet benchmarks show that our approach achieves competitive results, offering a resource-efficient solution for semantic understanding.
中文: 本文提出视点依赖投影(VDP)方法,通过数据驱动的投影和色彩正则化技术,动态生成适应点云空间几何的3D-2D映射,在保证语义理解性能的同时显著提升了计算效率。
English: This paper introduces view-dependent projection (VDP) to dynamically generate adaptive 3D-to-2D mappings for point cloud segmentation, optimizing computational efficiency and feature representation through data-driven projections and color regularization.
Authors:Yucheng Sheng, Jiacheng Wang, Xingyu Zhou, Le Liang, Hao Ye, Shi Jin, Geoffrey Ye Li
Abstract:
With the growing complexity and dynamics of the mobile communication networks, accurately predicting key system parameters, such as channel state information (CSI), user location, and network traffic, has become essential for a wide range of physical (PHY)-layer and medium access control (MAC)-layer tasks. Although traditional deep learning (DL)-based methods have been widely applied to such prediction tasks, they often struggle to generalize across different scenarios and tasks. In response, we propose a unified foundation model for multi-task prediction in wireless networks that supports diverse prediction intervals. The proposed model enforces univariate decomposition to unify heterogeneous tasks, encodes granularity for interval awareness, and uses a causal Transformer backbone for accurate predictions. Additionally, we introduce a patch masking strategy during training to support arbitrary input lengths. After trained on large-scale datasets, the proposed foundation model demonstrates strong generalization to unseen scenarios and achieves zero-shot performance on new tasks that surpass traditional full-shot baselines.
中文: 本文提出了一种用于无线网络多任务预测的统一基础模型,通过单变量分解、粒度编码和因果Transformer实现强泛化能力,并在新任务上达到零样本性能。
English: This paper introduces a unified foundation model for multi-task prediction in wireless networks, which employs univariate decomposition, granularity encoding, and a causal Transformer to achieve strong generalization and zero-shot performance on new tasks.
Authors:Yuxin Zhang, Jiahao Yang, Zhe Chen, Wenjun Zhu, Jin Zhao, Yue Gao
Abstract:
Recently, large vision-language models (LVLMs) unleash powerful analysis capabilities for low Earth orbit (LEO) satellite Earth observation images in the data center. However, fast satellite motion, brief satellite-ground station (GS) contact windows, and large size of the images pose a data download challenge. To enable near real-time Earth observation applications (e.g., disaster and extreme weather monitoring), we should explore how to deploy LVLM in LEO satellite networks, and design SpaceVerse, an efficient satellite-ground synergistic LVLM inference system. To this end, firstly, we deploy compact LVLMs on satellites for lightweight tasks, whereas regular LVLMs operate on GSs to handle computationally intensive tasks. Then, we propose a computing and communication co-design framework comprised of a progressive confidence network and an attention-based multi-scale preprocessing, used to identify on-satellite inferring data, and reduce data redundancy before satellite-GS transmission, separately. We implement and evaluate SpaceVerse on real-world LEO satellite constellations and datasets, achieving a 31.2% average gain in accuracy and a 51.2% reduction in latency compared to state-of-the-art baselines.
中文: SpaceVerse是一种高效的星地协同系统,通过在卫星和地面站分别部署精简版和标准版大视觉语言模型,显著提升了地球观测任务的准确性和响应速度。
English: SpaceVerse is an efficient satellite-ground synergistic system that deploys compact and regular large vision-language models on satellites and ground stations respectively, achieving significant improvements in accuracy and latency for Earth observation tasks.
Authors:Yuanzhe Hu, Yu Wang, Julian McAuley
Abstract:
Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.
中文: 现有大语言模型智能体基准对记忆能力的评估不足,因此我们推出了MemoryAgentBench,通过多轮交互全面评估四项核心记忆能力,实证表明现有方法仍存在明显缺陷。
English: Current benchmarks for LLM agents inadequately assess memory capabilities, prompting the introduction of MemoryAgentBench to evaluate four core memory competencies through multi-turn interactions, revealing significant gaps in existing methods.
Authors:Yuanzhe Hu, Yu Wang, Julian McAuley
Abstract:
Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. Existing benchmarks either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Moreover, no existing benchmarks cover all four competencies. We introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format, effectively simulating the incremental information processing characteristic of memory agents. By carefully selecting and curating datasets, our benchmark provides comprehensive coverage of the four core memory competencies outlined above, thereby offering a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.
中文: 现有大语言模型智能体基准对记忆能力的评估不足,因此我们推出了MemoryAgentBench,通过多轮交互全面评估四项核心记忆能力,实证表明现有方法仍存在明显缺陷。
English: Current benchmarks for LLM agents inadequately assess memory capabilities, prompting the introduction of MemoryAgentBench to evaluate four core memory competencies through multi-turn interactions, revealing significant gaps in existing methods.
Authors:Qika Lin, Fangzhi Xu, Hao Lu, Kai He, Rui Mao, Jun Liu, Erik Cambria, Mengling Feng
Abstract:
Knowledge Graph (KG) reasoning has received significant attention in the fields of artificial intelligence and knowledge engineering, owing to its ability to autonomously deduce new knowledge and consequently enhance the availability and precision of downstream applications. However, current methods predominantly concentrate on a single form of neural or symbolic reasoning, failing to effectively integrate the inherent strengths of both approaches. Furthermore, the current prevalent methods primarily focus on addressing a single reasoning scenario, presenting limitations in meeting the diverse demands of real-world reasoning tasks. Unifying the neural and symbolic methods, as well as diverse reasoning scenarios in one model is challenging as there is a natural representation gap between symbolic rules and neural networks, and diverse scenarios exhibit distinct knowledge structures and specific reasoning objectives. To address these issues, we propose a unified neurosymbolic reasoning framework, namely Tunsr, for KG reasoning. Tunsr first introduces a consistent structure of reasoning graph that starts from the query entity and constantly expands subsequent nodes by iteratively searching posterior neighbors. Based on it, a forward logic message-passing mechanism is proposed to update both the propositional representations and attentions, as well as first-order logic (FOL) representations and attentions of each node. In this way, Tunsr conducts the transformation of merging multiple rules by merging possible relations at each step. Finally, the FARI algorithm is proposed to induce FOL rules by constantly performing attention calculations over the reasoning graph. Extensive experimental results on 19 datasets of four reasoning scenarios (transductive, inductive, interpolation, and extrapolation) demonstrate the effectiveness of Tunsr.
中文: 知识图谱推理领域提出了Tunsr这一统一神经符号框架,它通过推理图和前向逻辑消息传递机制融合神经与符号方法,有效应对多种推理场景,并在19个数据集上验证了其优越性能。
English: Knowledge graph reasoning has advanced with the introduction of Tunsr, a unified neurosymbolic framework that integrates neural and symbolic methods to address diverse reasoning scenarios through a reasoning graph and forward logic message-passing, proving effective across 19 datasets.
Authors:Niki van Stein, Haoran Yin, Anna V. Kononova, Thomas Bäck, Gabriela Ochoa
Abstract:
We investigate the behaviour space of meta-heuristic optimisation algorithms automatically generated by Large Language Model driven algorithm discovery methods. Using the Large Language Evolutionary Algorithm (LLaMEA) framework with a GPT o4-mini LLM, we iteratively evolve black-box optimisation heuristics, evaluated on 10 functions from the BBOB benchmark suite. Six LLaMEA variants, featuring different mutation prompt strategies, are compared and analysed. We log dynamic behavioural metrics including exploration, exploitation, convergence and stagnation measures, for each run, and analyse these via visual projections and network-based representations. Our analysis combines behaviour-based
projections, Code Evolution Graphs built from static code features, performance convergence curves, and behaviour-based Search Trajectory Networks. The results reveal clear differences in search dynamics and algorithm structures across LLaMEA configurations. Notably, the variant that employs both a code simplification prompt and a random perturbation prompt in a 1+1 elitist evolution strategy, achieved the best performance, with the highest Area Over the Convergence Curve. Behaviour-space visualisations show that higher-performing algorithms exhibit more intensive exploitation behaviour and faster convergence with less stagnation. Our findings demonstrate how behaviour-space analysis can explain why certain LLM-designed heuristics outperform others and how LLM-driven algorithm discovery navigates the open-ended and complex search space of algorithms. These findings provide insights to guide the future design of adaptive LLM-driven algorithm generators.
中文: 本研究通过行为空间分析揭示了由大语言模型驱动的算法发现方法所生成的元启发式优化算法中,性能更优的变体表现出更强的开发能力、更快的收敛速度和更少的停滞现象,为未来自适应算法生成器的设计提供了重要见解。
English: This study analyzes the behavior space of meta-heuristic optimization algorithms generated by LLM-driven discovery methods, revealing that higher-performing variants exhibit more intensive exploitation and faster convergence with less stagnation, providing insights for future adaptive algorithm generators.
Authors:Samundra Karki, Ming-Chen Hsu, Adarsh Krishnamurthy, Baskar Ganapathysubramanian
Abstract:
Implicit Neural Representations (INRs), characterized by neural network-encoded signed distance fields, provide a powerful means to represent complex geometries continuously and efficiently. While successful in computer vision and generative modeling, integrating INRs into computational analysis workflows, such as finite element simulations, remains underdeveloped. In this work, we propose a computational framework that seamlessly combines INRs with the Shifted Boundary Method (SBM) for high-fidelity linear elasticity simulations without explicit geometry transformations. By directly querying the neural implicit geometry, we obtain the surrogate boundaries and distance vectors essential for SBM, effectively eliminating the meshing step. We demonstrate the efficacy and robustness of our approach through elasticity simulations on complex geometries (Stanford Bunny, Eiffel Tower, gyroids) sourced from triangle soups and point clouds. Our method showcases significant computational advantages and accuracy, underscoring its potential in biomedical, geophysical, and advanced manufacturing applications.
中文: 本研究提出了一种将隐式神经表示与移位边界法相结合的计算框架,可在复杂几何体上实现高保真线性弹性模拟,无需传统网格划分,并在多种应用中展现出卓越的鲁棒性。
English: This study introduces a computational framework integrating Implicit Neural Representations with the Shifted Boundary Method to enable high-fidelity linear elasticity simulations on complex geometries, eliminating traditional meshing while demonstrating robustness across diverse applications.
Authors:Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, Hao Zhou
Abstract:
Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.
中文摘要:MemAgent通过分段阅读与记忆覆盖策略的端到端优化,实现了从8K上下文训练向3.5M问答任务的外推性能损失低于5%,并在512K基准测试中达到95%以上的优异表现。
English Summary: MemAgent is an end-to-end optimized agent workflow that processes long texts in segments with memory overwriting, achieving minimal performance loss when extrapolating from 8K to 3.5M contexts and excelling in 512K benchmarks.
Authors:Tianshi Cao, Marie-Julie Rakotosaona, Ben Poole, Federico Tombari, Michael Niemeyer
Abstract:
We present Image2GS, a novel approach that addresses the challenging problem of reconstructing photorealistic 3D scenes from a single image by focusing specifically on the image-to-3D lifting component of the reconstruction process. By decoupling the lifting problem (converting an image to a 3D model representing what is visible) from the completion problem (hallucinating content not present in the input), we create a more deterministic task suitable for discriminative models. Our method employs visibility masks derived from optimized 3D Gaussian splats to exclude areas not visible from the source view during training. This masked training strategy significantly improves reconstruction quality in visible regions compared to strong baselines. Notably, despite being trained only on masked regions, Image2GS remains competitive with state-of-the-art discriminative models trained on full target images when evaluated on complete scenes. Our findings highlight the fundamental struggle discriminative models face when fitting unseen regions and demonstrate the advantages of addressing image-to-3D lifting as a distinct problem with specialized techniques.
中文: Image2GS提出了一种从单张图像重建逼真3D场景的新方法,通过专注于图像到3D的提升过程,在训练中使用可见性掩码来提高可见区域的还原质量,同时与先进模型保持竞争力。
English: Image2GS introduces a novel method for reconstructing photorealistic 3D scenes from a single image by focusing on the image-to-3D lifting process, using visibility masks during training to enhance reconstruction quality in visible areas while remaining competitive with state-of-the-art models.
Authors:Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Yizhe Zhang, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Josh Susskind, Navdeep Jaitly
Abstract:
Autoregressive models have driven remarkable progress in language modeling. Their foundational reliance on discrete tokens, unidirectional context, and single-pass decoding, while central to their success, also inspires the exploration of a design space that could offer new axes of modeling flexibility. In this work, we explore an alternative paradigm, shifting language modeling from a discrete token space to a continuous latent space. We propose a novel framework TarFlowLM, that employs transformer-based autoregressive normalizing flows to model these continuous representations. This approach unlocks substantial flexibility, enabling the construction of models that can capture global bi-directional context through stacked, alternating-direction autoregressive transformations, support block-wise generation with flexible token patch sizes, and facilitate a hierarchical multi-pass generation process. We further propose new mixture-based coupling transformations designed to capture complex dependencies within the latent space shaped by discrete data, and demonstrate theoretical connections to conventional discrete autoregressive models. Extensive experiments on language modeling benchmarks demonstrate strong likelihood performance and highlight the flexible modeling capabilities inherent in our framework.
Autoregressive models' success in language modeling inspires the exploration of alternative designs, leading to the proposal of TarFlowLM, a novel framework that shifts modeling to a continuous latent space using transformer-based normalizing flows to enable flexible bidirectional context modeling and hierarchical generation.
English Summary:
Authors:Sanghun Jung, Jingjing Zheng, Ke Zhang, Nan Qiao, Albert Y. C. Chen, Lu Xia, Chi Liu, Yuyin Sun, Xiao Zeng, Hsiang-Wei Huang, Byron Boots, Min Sun, Cheng-Hao Kuo
Abstract:
Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.
中文: 本文提出了一种先进的开放词汇3D实例分割框架,通过结合鲁棒的3D提案生成与Alpha-CLIP分类器及标准化相似度评分,在多个基准数据集上实现了最优性能。
English: This paper introduces a state-of-the-art open-vocabulary 3D instance segmentation framework that combines robust 3D proposal generation with Alpha-CLIP classification and standardized similarity scoring to achieve superior performance on benchmark datasets.
Authors:David Noever, Forrest McKee
Abstract:
This research work demonstrates that current AI systems fail catastrophically on auditory tasks that humans perform effortlessly. Drawing inspiration from Moravec's paradox (i.e., tasks simple for humans often prove difficult for machines, and vice versa), we introduce an auditory Turing test comprising 917 challenges across seven categories: overlapping speech, speech in noise, temporal distortion, spatial audio, coffee-shop noise, phone distortion, and perceptual illusions. Our evaluation of state-of-the-art audio models including GPT-4's audio capabilities and OpenAI's Whisper reveals a striking failure rate exceeding 93%, with even the best-performing model achieving only 6.9% accuracy on tasks that humans solved at 7.5 times higher success (52%). These results expose focusing failures in how AI systems process complex auditory scenes, particularly in selective attention, noise robustness, and contextual adaptation. Our benchmark not only quantifies the human-machine auditory gap but also provides insights into why these failures occur, suggesting that current architectures lack fundamental mechanisms for human-like auditory scene analysis. The traditional design of audio CAPTCHAs highlights common filters that humans evolved but machines fail to select in multimodal language models. This work establishes a diagnostic framework for measuring progress toward human-level machine listening and highlights the need for novel approaches integrating selective attention, physics-based audio understanding, and context-aware perception into multimodal AI systems.
中文摘要:该研究表明,当前AI系统在人类轻松应对的听觉任务上表现惨败,听觉图灵测试中失败率超93%,暴露了机器在选择性注意和场景分析方面的根本缺陷。
English Summary: This study reveals that current AI systems perform catastrophically on auditory tasks where humans excel, with models failing 93% of challenges in an auditory Turing test, highlighting critical gaps in selective attention and contextual adaptation.
Authors:Yanqing Liu, Ruiqing Xue, Chong Zhang, Yufei Liu, Gang Wang, Bohan Li, Yao Qian, Lei He, Shujie Liu, Sheng Zhao
Abstract:
While diffusion and autoregressive (AR) models have significantly advanced generative modeling, they each present distinct limitations. AR models, which rely on causal attention, cannot exploit future context and suffer from slow generation speeds. Conversely, diffusion models struggle with key-value (KV) caching. To overcome these challenges, we introduce Dragon-FM, a novel text-to-speech (TTS) design that unifies AR and flow-matching. This model processes 48 kHz audio codec tokens in chunks at a compact rate of 12.5 tokens per second. This design enables AR modeling across chunks, ensuring global coherence, while parallel flow-matching within chunks facilitates fast iterative denoising. Thus, the model leverages KV-cache across chunks and utilizes bidirectional context within each chunk. Furthermore, it bridges continuous and discrete feature modeling, demonstrating that continuous AR flow-matching can predict discrete tokens with finite scalar quantizers. This efficient codec and fast chunk-autoregressive architecture also make the model highly effective for generating long-form content, such as podcasts. Experiments on podcast datasets demonstrate its capability to efficiently generate high-quality zero-shot podcasts.
中文摘要:Dragon-FM模型通过结合自回归与流匹配技术,克服了现有生成模型的局限性,能够高效生成具有全局连贯性的48 kHz高质量音频,并实现快速处理长形式内容(如播客)的能力。
English Summary: The Dragon-FM model combines autoregressive and flow-matching techniques to overcome limitations in existing generative models, enabling efficient generation of high-quality 48 kHz audio with global coherence and fast processing for long-form content like podcasts.
Authors:Wenjie Jacky Mo, Qin Liu, Xiaofei Wen, Dongwon Jung, Hadi Askari, Wenxuan Zhou, Zhe Zhao, Muhao Chen
Abstract:
Large Language Models (LLMs) for code generation (i.e., Code LLMs) have demonstrated impressive capabilities in AI-assisted software development and testing. However, recent studies have shown that these models are prone to generating vulnerable or even malicious code under adversarial settings. Existing red-teaming approaches rely on extensive human effort, limiting their scalability and practicality, and generally overlook the interactive nature of real-world AI-assisted programming, which often unfolds over multiple turns. To bridge these gaps, we present RedCoder, a red-teaming agent that engages victim models in multi-turn conversation to elicit vulnerable code. The pipeline to construct RedCoder begins with a multi-agent gaming process that simulates adversarial interactions, yielding a set of prototype conversations and an arsenal of reusable attack strategies. We then fine-tune an LLM on these prototype conversations to serve as the backbone of RedCoder. Once deployed, RedCoder autonomously engages Code LLMs in multi-turn conversations, dynamically retrieving relevant strategies from the arsenal to steer the dialogue toward vulnerability-inducing outputs. Experiments across multiple Code LLMs show that our approach outperforms prior single-turn and multi-turn red-team methods in inducing vulnerabilities in code generation, offering a scalable and effective tool for evaluating the security boundaries of modern code-generation systems.
中文: RedCoder是一个多轮红队代理,通过动态检索攻击策略有效诱导代码大模型生成易受攻击的代码,在安全评估方面优于现有方法。
English: RedCoder is a multi-turn red-teaming agent that uses dynamically retrieved attack strategies to effectively elicit vulnerable code from Code LLMs, outperforming existing methods in security evaluation.
Authors:Vladislav Li, Ilias Siniosoglou, Panagiotis Sarigiannidis, Vasileios Argyriou
Abstract:
In contemporary training for industrial manufacturing, reconciling theoretical knowledge with practical experience continues to be a significant difficulty. As companies transition to more intricate and technology-oriented settings, conventional training methods frequently inadequately equip workers with essential practical skills while maintaining safety and efficiency. Virtual Reality has emerged as a transformational instrument to tackle this issue by providing immersive, interactive, and risk-free teaching experiences. Through the simulation of authentic industrial environments, virtual reality facilitates the acquisition of vital skills for trainees within a regulated and stimulating context, therefore mitigating the hazards linked to experiential learning in the workplace. This paper presents a sophisticated VR-based industrial training architecture aimed at improving learning efficacy via high-fidelity simulations, dynamic and context-sensitive scenarios, and adaptive feedback systems. The suggested system incorporates intuitive gesture-based controls, reducing the learning curve for users across all skill levels. A new scoring metric, namely, VR Training Scenario Score (VRTSS), is used to assess trainee performance dynamically, guaranteeing ongoing engagement and incentive. The experimental assessment of the system reveals promising outcomes, with significant enhancements in information retention, task execution precision, and overall training efficacy. The results highlight the capability of VR as a crucial instrument in industrial training, providing a scalable, interactive, and efficient substitute for conventional learning methods.
中文摘要:虚拟现实通过高保真模拟和自适应反馈系统,为工业制造提供沉浸式安全培训方案,有效提升技能掌握与培训效率。
English Summary: Virtual Reality offers an immersive and safe training solution for industrial manufacturing, enhancing skill acquisition and efficiency through high-fidelity simulations and adaptive feedback systems.
Authors:Yanda Chen, Zihui Ren, Qixiang Gao, Jiale Chen, Si Chen, Xubin Li, Tiezheng Ge, Bo Zheng
Abstract:
Advertising text plays a critical role in determining click-through rates (CTR) in online advertising. Large Language Models (LLMs) offer significant efficiency advantages over manual ad text creation. However, LLM-generated ad texts do not guarantee higher CTR performance compared to human-crafted texts, revealing a gap between generation quality and online performance of ad texts. In this work, we propose a novel ad text generation method which optimizes for CTR through preference optimization from online feedback. Our approach adopts an innovative two-stage framework: (1) diverse ad text sampling via one-shot in-context learning, using retrieval-augmented generation (RAG) to provide exemplars with chain-of-thought (CoT) reasoning; (2) CTR-driven preference optimization from online feedback, which weighs preference pairs according to their CTR gains and confidence levels. Through our method, the resulting model enables end-to-end generation of high-CTR ad texts. Extensive experiments have demonstrated the effectiveness of our method in both offline and online metrics. Notably, we have applied our method on a large-scale online shopping platform and achieved significant CTR improvements, showcasing its strong applicability and effectiveness in advertising systems.
中文摘要:本研究提出了一种新颖的两阶段广告文本生成方法,通过在线反馈的偏好学习优化点击率,在离线和在线评估中均展现出显著的CTR提升效果。
English Summary: This study introduces a novel two-stage ad text generation method that optimizes click-through rates through preference learning from online feedback, demonstrating significant CTR improvements in both offline and online evaluations.
Authors:Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab
Abstract:
Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs, and demonstrates promising results as an inference-time search strategy for code optimization.
中文摘要:GEPA作为一种利用自然语言反思的提示优化器,在效率和效果上显著优于GRPO等强化学习方法,能以极少的尝试次数获得更优的性能表现。
English Summary: GEPA, a prompt optimizer using natural language reflection, significantly outperforms reinforcement learning methods like GRPO in efficiency and effectiveness, achieving better results with far fewer rollouts.
Authors:Jingxuan Wei, Caijun Jia, Qi Chen, Yujun Cai, Linzhuang Sun, Xiangxiang Zhang, Gaowei Wu, Bihui Yu
Abstract:
Multimodal Machine Translation (MMT) enhances translation quality by incorporating visual context, helping to resolve textual ambiguities. While existing MMT methods perform well in bilingual settings, extending them to multilingual translation remains challenging due to cross-lingual interference and ineffective parameter-sharing strategies. To address this, we propose LLaVA-NeuMT, a novel multimodal multilingual translation framework that explicitly models language-specific and language-agnostic representations to mitigate multilingual interference. Our approach consists of a layer selection mechanism that identifies the most informative layers for different language pairs and a neuron-level adaptation strategy that dynamically selects language-specific and agnostic neurons to improve translation quality while reducing redundancy. We conduct extensive experiments on the M3-Multi30K and M3-AmbigCaps datasets, demonstrating that LLaVA-NeuMT, while fine-tuning only 40\% of the model parameters, surpasses full fine-tuning approaches and ultimately achieves SOTA results on both datasets. Our analysis further provides insights into the importance of selected layers and neurons in multimodal multilingual adaptation, offering an efficient and scalable solution to cross-lingual adaptation in multimodal translation.
中文:提出的LLaVA-NeuMT框架通过自适应层级选择与神经元调整机制,显式建模语言相关与无关表征来提升多模态多语言翻译性能,仅需微调40%参数即在两个基准数据集上取得最优结果。
English: The proposed LLaVA-NeuMT framework enhances multimodal multilingual translation by modeling language-specific and agnostic representations with adaptive layer and neuron selection, achieving state-of-the-art results using only 40% fine-tuned parameters.
Authors:Suhang Wu, Jialong Tang, Chengyi Yang, Pei Zhang, Baosong Yang, Junhui Li, Junfeng Yao, Min Zhang, Jinsong Su
Abstract:
Direct speech translation (ST) has garnered increasing attention nowadays, yet the accurate translation of terminology within utterances remains a great challenge. In this regard, current studies mainly concentrate on leveraging various translation knowledge into ST models. However, these methods often struggle with interference from irrelevant noise and can not fully utilize the translation knowledge. To address these issues, in this paper, we propose a novel Locate-and-Focus method for terminology translation. It first effectively locates the speech clips containing terminologies within the utterance to construct translation knowledge, minimizing irrelevant information for the ST model. Subsequently, it associates the translation knowledge with the utterance and hypothesis from both audio and textual modalities, allowing the ST model to better focus on translation knowledge during translation. Experimental results across various datasets demonstrate that our method effectively locates terminologies within utterances and enhances the success rate of terminology translation, while maintaining robust general translation performance.
中文: 本文提出的“定位与聚焦”方法通过首先识别语音中包含术语的片段以减少干扰,然后在音频和文本模态中整合翻译知识,显著提升了术语翻译的准确率,同时保持了整体翻译性能。
English: The proposed Locate-and-Focus method enhances direct speech translation by first identifying terminology-containing speech segments to minimize noise and then integrating translation knowledge across audio and text modalities, significantly improving terminology translation accuracy while preserving overall performance.
Authors:Rayan Mazouz, Luca Laurenti, Morteza Lahijanian
Abstract:
This paper presents a method for the simultaneous synthesis of a barrier certificate and a safe controller for discrete-time nonlinear stochastic systems. Our approach, based on piecewise stochastic control barrier functions, reduces the synthesis problem to a minimax optimization, which we solve exactly using a dual linear program with zero gap. This enables the joint optimization of the barrier certificate and safe controller within a single formulation. The method accommodates stochastic dynamics with additive noise and a bounded continuous control set. The synthesized controllers and barrier certificates provide a formally guaranteed lower bound on probabilistic safety. Case studies on linear and nonlinear stochastic systems validate the effectiveness of our approach.
中文摘要: 本文提出了一种基于分段随机控制屏障函数的方法,通过零间隙对偶线性规划求解极小极大优化,实现了离散时间非线性随机系统的屏障证书与安全控制器协同合成,并提供了概率安全性的形式化保证。
English Summary: This paper introduces a method for jointly synthesizing barrier certificates and safe controllers for discrete-time nonlinear stochastic systems through minimax optimization solved via a dual linear program, ensuring formally guaranteed probabilistic safety.
Authors:Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, Tat-Seng Chua
Abstract:
Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities to improve multimodal understanding. Traditional methods often depend on pairwise contrastive learning, which relies on a predefined anchor modality, restricting alignment across all modalities. Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain, such as limitations imposed by fixed anchor points and instability arising from optimizing the product of singular values. To address the challenges, in this paper, we propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities without anchor dependency in a more stable manner. Specifically, grounded in the theoretical insight that full alignment corresponds to a rank-1 Gram matrix, PMRL optimizes the dominant singular value of the representation matrix to align modalities along a shared leading direction. We propose a softmax-based loss function that treats singular values as logits to prioritize the largest singular value. Besides, instance-wise contrastive regularization on the leading eigenvectors maintains inter-instance separability and prevents representation collapse. Extensive experiments across diverse tasks demonstrate PMRL's superiority compared to baseline methods. The source code will be publicly available.
中文: 本文提出原则性多模态表示学习(PMRL)框架,通过优化表示矩阵的主奇异值实现无需锚点的多模态同步对齐,在多项任务中展现出优越性能。
English: This paper introduces Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities without anchor dependency by optimizing the dominant singular value of the representation matrix, demonstrating superior performance across diverse tasks.
Authors:Qifan Zhang, Nuo Chen, Zehua Li, Miao Peng, Jing Tang, Jia Li
Abstract:
Large Language Models (LLMs) have made remarkable strides in reasoning tasks, yet their performance often falters on novel and complex problems. Domain-specific continued pretraining (CPT) methods, such as those tailored for mathematical reasoning, have shown promise but lack transferability to broader reasoning tasks. In this work, we pioneer the use of Graph Problem Reasoning (GPR) to enhance the general reasoning capabilities of LLMs. GPR tasks, spanning pathfinding, network analysis, numerical computation, and topological reasoning, require sophisticated logical and relational reasoning, making them ideal for teaching diverse reasoning patterns. To achieve this, we introduce GraphPile, the first large-scale corpus specifically designed for CPT using GPR data. Spanning 10.9 billion tokens across 23 graph tasks, the dataset includes chain-of-thought, program-of-thought, trace of execution, and real-world graph data. Using GraphPile, we train GraphMind on popular base models Llama 3 and 3.1, as well as Gemma 2, achieving up to 4.9 percent higher accuracy in mathematical reasoning and up to 21.2 percent improvement in non-mathematical reasoning tasks such as logical and commonsense reasoning. By being the first to harness GPR for enhancing reasoning patterns and introducing the first dataset of its kind, our work bridges the gap between domain-specific pretraining and universal reasoning capabilities, advancing the adaptability and robustness of LLMs.
Chinese: 本研究首创性地利用图问题推理构建了GraphPile大规模数据集,通过持续预训练显著提升了大型语言模型在数学与非数学推理任务上的表现,有效弥合了领域专用预训练与通用推理能力之间的鸿沟。
English: This study introduces GraphPile, a pioneering large-scale dataset for graph problem reasoning, and demonstrates that continued pretraining with it significantly enhances both mathematical and non-mathematical reasoning in large language models, bridging domain-specific and universal reasoning capabilities.
Authors:Luchuan Song, Yang Zhou, Zhan Xu, Yi Zhou, Deepali Aneja, Chenliang Xu
Abstract:
We propose StreamME, a method focuses on fast 3D avatar reconstruction. The StreamME synchronously records and reconstructs a head avatar from live video streams without any pre-cached data, enabling seamless integration of the reconstructed appearance into downstream applications. This exceptionally fast training strategy, which we refer to as on-the-fly training, is central to our approach. Our method is built upon 3D Gaussian Splatting (3DGS), eliminating the reliance on MLPs in deformable 3DGS and relying solely on geometry, which significantly improves the adaptation speed to facial expression. To further ensure high efficiency in on-the-fly training, we introduced a simplification strategy based on primary points, which distributes the point clouds more sparsely across the facial surface, optimizing points number while maintaining rendering quality. Leveraging the on-the-fly training capabilities, our method protects the facial privacy and reduces communication bandwidth in VR system or online conference. Additionally, it can be directly applied to downstream application such as animation, toonify, and relighting. Please refer to our project page for more details: https://songluchuan.github.io/StreamME/.
中文: StreamME是一种快速3D虚拟形象重建方法,通过基于3D高斯泼溅的实时训练技术,能在直播视频流中同步构建头像模型,既保护面部隐私又优化传输带宽。
English: StreamME is a fast 3D avatar reconstruction method that uses on-the-fly training with 3D Gaussian Splatting to enable real-time avatar creation from live video streams while ensuring facial privacy and efficient bandwidth usage.
Authors:Bo Hou, Xin Tan, Kai Zheng, Fang Liu, Yinghao Zhu, Li Zhang
Abstract:
Atomic commits, each of which addresses a single development concern, are a best practice in software development. However, developers frequently produce tangled commits that mix unrelated changes due to practical constraints or unclear boundaries, negatively impacting code review and maintenance. Although prior commit untangling approaches: rule-based, feature-based, or graph-based, have made progress, they often rely on shallow signals and fail to distinguish between explicit dependencies (e.g., control/data flow) and implicit ones (e.g., semantic or conceptual relationships). In this paper, we propose ColaUntangle, a new collaborative consultation framework for commit untangling that models both explicit and implicit dependencies among code changes. ColaUntangle integrates Large Language Model (LLM)-driven agents in a multi-agent architecture: one agent specializes in explicit dependencies, another in implicit ones, and a reviewer agent synthesizes their perspectives through iterative consultation. To capture explicit and implicit contextual information, we construct multi-version Program Dependency Graphs (delta-PDG), enabling agents to reason over code relationships with both symbolic and semantic depth. We evaluate ColaUntangle on two widely-used datasets (1,612 C# and 14k Java tangled commits). Experimental results show that ColaUntangle outperforms the best-performing baseline, achieving an improvement of 44% on the C# dataset and 100% on the Java dataset. These findings highlight the potential of LLM-based collaborative frameworks for advancing automated commit untangling tasks.
中文: 本文提出ColaUntangle框架,通过多智能体协同建模代码变更的显性与隐性依赖关系,在两大基准数据集上显著优于现有最佳方法,展现了LLM驱动框架在提交解耦任务中的潜力。
English: This paper introduces ColaUntangle, a collaborative LLM-driven framework that untangles software commits by modeling both explicit and implicit dependencies through specialized agents, demonstrating significant performance improvements over existing methods.
Authors:Elahe Vedadi, David Barrett, Natalie Harris, Ellery Wulczyn, Shashir Reddy, Roma Ruparel, Mike Schaekermann, Tim Strother, Ryutaro Tanno, Yash Sharma, Jihyeon Lee, CÃan Hughes, Dylan Slack, Anil Palepu, Jan Freyberg, Khaled Saab, Valentin Liévin, Wei-Hung Weng, Tao Tu, Yun Liu, Nenad Tomasev, Kavita Kulkarni, S. Sara Mahdavi, Kelvin Guu, Joëlle Barral, Dale R. Webster, James Manyika, Avinatan Hassidim, Katherine Chou, Yossi Matias, Pushmeet Kohli, Adam Rodman, Vivek Natarajan, Alan Karthikesalingam, David Stutz
Abstract:
Recent work has demonstrated the promise of conversational AI systems for diagnostic dialogue. However, real-world assurance of patient safety means that providing individual diagnoses and treatment plans is considered a regulated activity by licensed professionals. Furthermore, physicians commonly oversee other team members in such activities, including nurse practitioners (NPs) or physician assistants/associates (PAs). Inspired by this, we propose a framework for effective, asynchronous oversight of the Articulate Medical Intelligence Explorer (AMIE) AI system. We propose guardrailed-AMIE (g-AMIE), a multi-agent system that performs history taking within guardrails, abstaining from individualized medical advice. Afterwards, g-AMIE conveys assessments to an overseeing primary care physician (PCP) in a clinician cockpit interface. The PCP provides oversight and retains accountability of the clinical decision. This effectively decouples oversight from intake and can thus happen asynchronously. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) of text consultations with asynchronous oversight, we compared g-AMIE to NPs/PAs or a group of PCPs under the same guardrails. Across 60 scenarios, g-AMIE outperformed both groups in performing high-quality intake, summarizing cases, and proposing diagnoses and management plans for the overseeing PCP to review. This resulted in higher quality composite decisions. PCP oversight of g-AMIE was also more time-efficient than standalone PCP consultations in prior work. While our study does not replicate existing clinical practices and likely underestimates clinicians' capabilities, our results demonstrate the promise of asynchronous oversight as a feasible paradigm for diagnostic AI systems to operate under expert human oversight for enhancing real-world care.
中文: 提出的g-AMIE系统通过异步监督机制,在避免直接诊断的前提下实现AI辅助病史采集,临床模拟显示其在遵循相同安全规范时表现优于人类医疗提供者。
English: The proposed g-AMIE system enables asynchronous physician oversight of AI-driven medical history collection while abstaining from direct diagnosis, demonstrating superior performance in clinical simulations compared to human providers under equivalent safety constraints.
Authors:Renxi Wang, Rifo Ahmad Genadi, Bilal El Bouardi, Yongxin Wang, Fajri Koto, Zhengzhong Liu, Timothy Baldwin, Haonan Li
Abstract:
Language model (LM) agents have gained significant attention for their ability to autonomously complete tasks through interactions with environments, tools, and APIs. LM agents are primarily built with prompt engineering or supervised finetuning. At the same time, reinforcement learning (RL) has been explored to enhance LM's capabilities, such as reasoning and factuality. However, the combination of the LM agents and reinforcement learning (Agent-RL) remains underexplored and lacks systematic study. To this end, we built AgentFly, a scalable and extensible Agent-RL framework designed to empower LM agents with a variety of RL algorithms. Our framework supports multi-turn interactions by adapting traditional RL methods with token-level masking. It features a decorator-based interface for defining tools and reward functions, enabling seamless extension and ease of use. To support high-throughput training, we implement asynchronous execution of tool calls and reward computations, and design a centralized resource management system for scalable environment coordination. We also provide a suite of prebuilt tools and environments, demonstrating the framework's effectiveness through successful agent training across multiple tasks.
中文: 语言模型代理主要通过提示工程或监督微调构建,强化学习虽能增强其推理与事实性能力,但两者结合的研究尚不充分;为此开发的AgentFly框架具备可扩展性,支持多轮交互、工具定义及高效训练,成功应用于多任务代理训练。
English: Language model agents are increasingly developed through prompt engineering or supervised fine-tuning, with reinforcement learning (RL) enhancing their capabilities, yet the integration of LM agents and RL remains understudied, leading to the creation of AgentFly, a scalable framework that supports multi-turn interactions, tool definition, and high-throughput training for effective agent development.
Authors:Yu Wei, Jiahui Zhang, Xiaoqin Zhang, Ling Shao, Shijian Lu
Abstract:
COLMAP-free 3D Gaussian Splatting (3D-GS) has recently attracted increasing attention due to its remarkable performance in reconstructing high-quality 3D scenes from unposed images or videos. However, it often struggles to handle scenes with complex camera trajectories as featured by drastic rotation and translation across adjacent camera views, leading to degraded estimation of camera poses and further local minima in joint optimization of camera poses and 3D-GS. We propose PCR-GS, an innovative COLMAP-free 3DGS technique that achieves superior 3D scene modeling and camera pose estimation via camera pose co-regularization. PCR-GS achieves regularization from two perspectives. The first is feature reprojection regularization which extracts view-robust DINO features from adjacent camera views and aligns their semantic information for camera pose regularization. The second is wavelet-based frequency regularization which exploits discrepancy in high-frequency details to further optimize the rotation matrix in camera poses. Extensive experiments over multiple real-world scenes show that the proposed PCR-GS achieves superior pose-free 3D-GS scene modeling under dramatic changes of camera trajectories.
Chinese: PCR-GS提出了一种无需COLMAP的3D高斯溅射技术,通过特征重投影和小波频率双重正则化方法,在复杂相机轨迹下实现了优越的三维场景建模与相机姿态估计。
English: PCR-GS introduces a COLMAP-free 3D Gaussian Splatting technique that enhances 3D scene modeling and camera pose estimation through dual regularization methods, effectively handling complex camera trajectories with drastic movements.
Authors:Yikang Liu, Wanyang Zhang, Yiming Wang, Jialong Tang, Pei Zhang, Baosong Yang, Fei Huang, Rui Wang, Hai Hu
Abstract:
Translationese refers to linguistic properties that usually occur in translated texts. Previous works study translationese by framing it as a binary classification between original texts and translated texts. In this paper, we argue that translationese should be graded instead of binary and propose the first measure for translationese -- the translationese-index (T-index), computed from the likelihood ratios of two contrastively fine-tuned language models (LMs). We use synthesized translations and translations in the wild to evaluate T-index's generalizability in cross-domain settings and its validity against human judgments. Our results show that T-index can generalize to unseen genres, authors, and language pairs. Moreover, T-index computed using two 0.5B LMs fine-tuned on only 1-5k pairs of synthetic data can effectively capture translationese, as demonstrated by alignment with human pointwise ratings and pairwise judgments. Additionally, the correlation between T-index and existing machine translation (MT) quality estimation (QE) metrics such as BLEU and COMET is low, suggesting that T-index is not covered by these metrics and can serve as a complementary metric in MT QE.
中文摘要:本文提出翻译文本指数(T-index),通过微调语言模型计算翻译文本的梯度特征,该指标在跨领域场景中展现良好泛化能力,与人工评估高度一致,并能作为机器翻译质量评估的补充指标。
English Summary: This paper introduces the Translationese-Index (T-index), a graded measurement of translationese computed using fine-tuned language models, which demonstrates strong generalizability across domains and aligns well with human judgments while serving as a complementary metric to existing machine translation quality assessments.
Authors:Yao Lu, Hongyu Gao, Zhuangzhi Chen, Dongwei Xu, Yun Lin, Qi Xuan, Guan Gui
Abstract:
Although deep neural networks have made remarkable achievements in the field of automatic modulation recognition (AMR), these models often require a large amount of labeled data for training. However, in many practical scenarios, the available target domain data is scarce and difficult to meet the needs of model training. The most direct way is to collect data manually and perform expert annotation, but the high time and labor costs are unbearable. Another common method is data augmentation. Although it can enrich training samples to a certain extent, it does not introduce new data and therefore cannot fundamentally solve the problem of data scarcity. To address these challenges, we introduce a data expansion framework called Dynamic Uncertainty-driven Sample Expansion (DUSE). Specifically, DUSE uses an uncertainty scoring function to filter out useful samples from relevant AMR datasets and employs an active learning strategy to continuously refine the scorer. Extensive experiments demonstrate that DUSE consistently outperforms 8 coreset selection baselines in both class-balance and class-imbalance settings. Besides, DUSE exhibits strong cross-architecture generalization for unseen models.
中文: 提出的动态不确定性驱动样本扩展(DUSE)框架通过不确定性评分和主动学习从现有数据集中筛选有价值样本,有效解决了自动调制识别中的数据稀缺问题,在多种实验设置下均优于多个基线方法。
English: The proposed Dynamic Uncertainty-driven Sample Expansion (DUSE) framework effectively addresses data scarcity in automatic modulation recognition by filtering valuable samples from existing datasets through uncertainty scoring and active learning, outperforming multiple baselines across various settings.
Authors:Sirui Han, Junqi Zhu, Ruiyuan Zhang, Yike Guo
Abstract:
This paper presents the development of HKGAI-V1, a foundational sovereign large language model (LLM), developed as part of an initiative to establish value-aligned AI infrastructure specifically tailored for Hong Kong. Addressing the region's unique multilingual environment (Cantonese, Mandarin, and English), its distinct socio-legal context under the "one country, two systems" framework, and specific local cultural and value considerations, the model is built upon the DeepSeek architecture and systematically aligned with regional norms through a multifaceted full parameter fine-tuning process. It is further integrated with a retrieval-augmented generation (RAG) system to ensure timely and factually grounded information access. The core contribution lies in the design and implementation of a comprehensive, region-specific AI alignment and safety framework, demonstrated through two key achievements: 1) The successful development of HKGAI-V1 itself - which outper-forms general-purpose models in handling Hong Kong-specific culturally sensitive queries, and embodies a "governance-embedded" approach to digital sovereignty - empowers Hong Kong to exercise control over AI applications in critical sectors including public services, legal systems, and edu-cation. 2) The development of the proprietary Adversarial HK Value Benchmark, a rigorous tool for evaluating model alignment with local ethical and legal stand-ards under challenging conditions. By documenting these achievements, the paper provides not only a technological artifact but also a replicable blueprint for developing advanced, regionally focused AI systems deeply rooted in their local identities.
中文: 本文介绍了专为香港多语言及社会法律环境设计的自主大语言模型HKGAI-V1,该模型在处理本土文化敏感问题上优于通用模型,并为关键领域提供了嵌入式治理框架。
English: This paper introduces HKGAI-V1, a sovereign large language model tailored for Hong Kong's multilingual and socio-legal context, which outperforms general models in local cultural sensitivity and integrates a governance-embedded framework for critical sectors.
Authors:Haoxuan Qu, Yujun Cai, Hossein Rahmani, Ajay Kumar, Junsong Yuan, Jun Liu
Abstract:
Recently, Gaussian Splatting (GS) has received a lot of attention in surface reconstruction. However, while 3D objects can be of complex and diverse shapes in the real world, existing GS-based methods only limitedly use a single type of splatting primitive (Gaussian ellipse or Gaussian ellipsoid) to represent object surfaces during their reconstruction. In this paper, we highlight that this can be insufficient for object surfaces to be represented in high quality. Thus, we propose a novel framework that, for the first time, enables Gaussian Splatting to incorporate multiple types of (geometrical) primitives during its surface reconstruction process. Specifically, in our framework, we first propose a compositional splatting strategy, enabling the splatting and rendering of different types of primitives in the Gaussian Splatting pipeline. In addition, we also design our framework with a mixed-primitive-based initialization strategy and a vertex pruning mechanism to further promote its surface representation learning process to be well executed leveraging different types of primitives. Extensive experiments show the efficacy of our framework and its accurate surface reconstruction performance.
中文: 本文提出了一种新颖的高斯泼溅框架,通过融合多种几何基元来提升表面重建质量,利用组合式泼溅策略和优化的初始化方法克服了单一基元表示的局限性。
English: This paper introduces a novel Gaussian Splatting framework that incorporates multiple geometric primitives to enhance surface reconstruction quality, addressing the limitations of single-primitive methods through compositional splatting and optimized initialization strategies.
Authors:Lingwei Kong, Lu Wang, Changping Peng, Zhangang Lin, Ching Law, Jingping Shao
Abstract:
Click-Through Rate (CTR) prediction models are integral to a myriad of industrial settings, such as personalized search advertising. Current methods typically involve feature extraction from users' historical behavior sequences combined with product information, feeding into a discriminative model that is trained on user feedback to estimate CTR. With the success of models such as GPT, the potential for generative models to enrich expressive power beyond discriminative models has become apparent. In light of this, we introduce a novel model that leverages generative models to enhance the precision of CTR predictions in discriminative models. To reconcile the disparate data aggregation needs of both model types, we design a two-stage training process: 1) Generative pre-training for next-item prediction with the given item category in user behavior sequences; 2) Fine-tuning the well-trained generative model within a discriminative CTR prediction framework. Our method's efficacy is substantiated through extensive experiments on a new dataset, and its significant utility is further corroborated by online A/B testing results. Currently, the model is deployed on one of the world's largest e-commerce platforms, and we intend to release the associated code and dataset in the future.
中文摘要:本文提出了一种新颖的两阶段模型,通过生成式预训练进行下一商品预测,并结合判别式微调来提升点击率预测精度,经实验验证并在全球大型电商平台实际部署,效果显著。
English Summary: This paper introduces a novel two-stage model that integrates generative pre-training for next-item prediction with discriminative fine-tuning to enhance Click-Through Rate prediction accuracy, demonstrating significant improvements through experiments and real-world deployment on a major e-commerce platform.
Authors:Tao Xiao, Youmei Fan, Fabio Calefato, Christoph Treude, Raula Gaikovina Kula, Hideaki Hata, Sebastian Baltes
Abstract:
The widespread adoption of generative AI (GenAI) tools such as GitHub Copilot and ChatGPT is transforming software development. Since generated source code is virtually impossible to distinguish from manually written code, their real-world usage and impact on open-source software development remain poorly understood. In this paper, we introduce the concept of self-admitted GenAI usage, that is, developers explicitly referring to the use of GenAI tools for content creation in software artifacts. Using this concept as a lens to study how GenAI tools are integrated into open-source software projects, we analyze a curated sample of more than 250,000 GitHub repositories, identifying 1,292 such self-admissions across 156 repositories in commit messages, code comments, and project documentation. Using a mixed methods approach, we derive a taxonomy of 32 tasks, 10 content types, and 11 purposes associated with GenAI usage based on 284 qualitatively coded mentions. We then analyze 13 documents with policies and usage guidelines for GenAI tools and conduct a developer survey to uncover the ethical, legal, and practical concerns behind them. Our findings reveal that developers actively manage how GenAI is used in their projects, highlighting the need for project-level transparency, attribution, and quality control practices in the new era of AI-assisted software development. Finally, we examine the impact of GenAI adoption on code churn in 151 repositories with self-admitted GenAI usage and find no general increase, contradicting popular narratives on the impact of GenAI on software development.
中文: 生成式AI工具通过开发者主动声明使用情况正改变软件开发,研究表明开发者通过透明度和质量控制积极管理AI整合,同时发现代码变更并未普遍增加。
English: Generative AI tools are transforming software development by enabling self-admitted usage tracking, with research revealing developers actively manage AI integration through transparency and quality control, while finding no evidence of increased code churn.
Authors:Hongxu Ma, Chenbo Zhang, Lu Zhang, Jiaogen Zhou, Jihong Guan, Shuigeng Zhou
Abstract:
Zero-shot object detection (ZSD) aims to leverage semantic descriptions to localize and recognize objects of both seen and unseen classes. Existing ZSD works are mainly coarse-grained object detection, where the classes are visually quite different, thus are relatively easy to distinguish. However, in real life we often have to face fine-grained object detection scenarios, where the classes are too similar to be easily distinguished. For example, detecting different kinds of birds, fishes, and flowers.
In this paper, we propose and solve a new problem called Fine-Grained Zero-Shot Object Detection (FG-ZSD for short), which aims to detect objects of different classes with minute differences in details under the ZSD paradigm. We develop an effective method called MSHC for the FG-ZSD task, which is based on an improved two-stage detector and employs a multi-level semantics-aware embedding alignment loss, ensuring tight coupling between the visual and semantic spaces. Considering that existing ZSD datasets are not suitable for the new FG-ZSD task, we build the first FG-ZSD benchmark dataset FGZSD-Birds, which contains 148,820 images falling into 36 orders, 140 families, 579 genera and 1432 species. Extensive experiments on FGZSD-Birds show that our method outperforms existing ZSD models.
中文: 本文提出了细粒度零样本目标检测(FG-ZSD)这一新问题,通过改进的双阶段检测器和多级语义感知嵌入对齐损失方法MSHC,在构建的首个FG-ZSD基准数据集FGZSD-Birds上验证了其优于现有模型的性能。
English: This paper introduces Fine-Grained Zero-Shot Object Detection (FG-ZSD), a novel approach that detects visually similar objects using semantic descriptions and presents the MSHC method with a multi-level alignment loss, validated on a new benchmark dataset FGZSD-Birds where it outperforms existing models.
Authors:Xinlong Ding, Hongwei Yu, Jiawei Li, Feifan Li, Yu Shang, Bochao Zou, Huimin Ma, Jiansheng Chen
Abstract:
Camera pose estimation is a fundamental computer vision task that is essential for applications like visual localization and multi-view stereo reconstruction. In the object-centric scenarios with sparse inputs, the accuracy of pose estimation can be significantly influenced by background textures that occupy major portions of the images across different viewpoints. In light of this, we introduce the Kaleidoscopic Background Attack (KBA), which uses identical segments to form discs with multi-fold radial symmetry. These discs maintain high similarity across different viewpoints, enabling effective attacks on pose estimation models even with natural texture segments. Additionally, a projected orientation consistency loss is proposed to optimize the kaleidoscopic segments, leading to significant enhancement in the attack effectiveness. Experimental results show that optimized adversarial kaleidoscopic backgrounds can effectively attack various camera pose estimation models.
中文总结:本文提出万花筒背景攻击(KBA),通过使用对称纹理圆盘干扰相机姿态估计模型,并利用投影方向一致性损失优化攻击效果,有效针对多种姿态估计方法实现对抗攻击。
English Summary: The paper introduces the Kaleidoscopic Background Attack (KBA), which uses symmetrical texture discs to deceive camera pose estimation models by exploiting background interference, and enhances attack effectiveness through a novel consistency loss optimization.
Authors:Yixun Zhang, Lizhi Wang, Junjun Zhao, Wending Zhao, Feng Zhou, Yonghao Dang, Jianqin Yin
Abstract:
Camera-based object detection systems play a vital role in autonomous driving, yet they remain vulnerable to adversarial threats in real-world environments. Existing 2D and 3D physical attacks, due to their focus on texture optimization, often struggle to balance physical realism and attack robustness. In this work, we propose 3D Gaussian-based Adversarial Attack (3DGAA), a novel adversarial object generation framework that leverages the full 14-dimensional parameterization of 3D Gaussian Splatting (3DGS) to jointly optimize geometry and appearance in physically realizable ways. Unlike prior works that rely on patches or texture optimization, 3DGAA jointly perturbs both geometric attributes (shape, scale, rotation) and appearance attributes (color, opacity) to produce physically realistic and transferable adversarial objects. We further introduce a physical filtering module that filters outliers to preserve geometric fidelity, and a physical augmentation module that simulates complex physical scenarios to enhance attack generalization under real-world conditions. We evaluate 3DGAA on both virtual benchmarks and physical-world setups using miniature vehicle models. Experimental results show that 3DGAA achieves to reduce the detection mAP from 87.21\% to 7.38\%, significantly outperforming existing 3D physical attacks. Moreover, our method maintains high transferability across different physical conditions, demonstrating a new state-of-the-art in physically realizable adversarial attacks.
中文: 提出的3D高斯对抗攻击框架通过联合优化几何与外观属性,生成物理真实的对抗物体,在显著降低检测精度的同时保持跨场景的高迁移性,实现了物理可实现的对抗攻击新突破。
English: The proposed 3D Gaussian-based Adversarial Attack (3DGAA) framework jointly optimizes geometry and appearance through 3D Gaussian Splatting to create physically realistic adversarial objects that significantly reduce object detection accuracy while maintaining strong transferability across different conditions.
Authors:Yixun Zhang, Lizhi Wang, Junjun Zhao, Wending Zhao, Feng Zhou, Yonghao Dang, Jianqin Yin
Abstract:
Camera-based object detection systems play a vital role in autonomous driving, yet they remain vulnerable to adversarial threats in real-world environments. Existing 2D and 3D physical attacks, due to their focus on texture optimization, often struggle to balance physical realism and attack robustness. In this work, we propose 3D Gaussian-based Adversarial Attack (3DGAA), a novel adversarial object generation framework that leverages the full 14-dimensional parameterization of 3D Gaussian Splatting (3DGS) to jointly optimize geometry and appearance in physically realizable ways. Unlike prior works that rely on patches or texture optimization, 3DGAA jointly perturbs both geometric attributes (shape, scale, rotation) and appearance attributes (color, opacity) to produce physically realistic and transferable adversarial objects. We further introduce a physical filtering module that filters outliers to preserve geometric fidelity, and a physical augmentation module that simulates complex physical scenarios to enhance attack generalization under real-world conditions. We evaluate 3DGAA on both virtual benchmarks and physical-world setups using miniature vehicle models. Experimental results show that 3DGAA achieves to reduce the detection mAP from 87.21\% to 7.38\%, significantly outperforming existing 3D physical attacks. Moreover, our method maintains high transferability across different physical conditions, demonstrating a new state-of-the-art in physically realizable adversarial attacks.
中文: 提出的3D高斯对抗攻击框架通过联合优化几何与外观属性,生成物理真实的对抗物体,在显著降低检测精度的同时保持跨场景的高迁移性,实现了物理可实现的对抗攻击新突破。
English: The proposed 3D Gaussian-based Adversarial Attack (3DGAA) framework jointly optimizes geometry and appearance through 3D Gaussian Splatting to create physically realistic adversarial objects that significantly reduce object detection accuracy while maintaining strong transferability across different conditions.
Authors:David Noever, Forrest McKee
Abstract:
This paper presents a novel method of executable steganography using the alpha transparency layer of ICO image files to embed and deliver self-decompressing JavaScript payloads within web browsers. By targeting the least significant bit (LSB) of non-transparent alpha layer image values, the proposed method successfully conceals compressed JavaScript code inside a favicon image without affecting visual fidelity. Global web traffic loads 294 billion favicons daily and consume 0.9 petabytes of network bandwidth. A proof-of-concept implementation demonstrates that a 64x64 ICO image can embed up to 512 bytes uncompressed, or 0.8 kilobyte when using lightweight two-fold compression. On page load, a browser fetches the favicon as part of standard behavior, allowing an embedded loader script to extract and execute the payload entirely in memory using native JavaScript APIs and canvas pixel access. This creates a two-stage covert channel requiring no additional network or user requests. Testing across multiple browsers in both desktop and mobile environments confirms successful and silent execution of the embedded script. We evaluate the threat model, relate it to polymorphic phishing attacks that evade favicon-based detection, and analyze evasion of content security policies and antivirus scanners. We map nine example MITRE ATT&CK Framework objectives to single line JavaScript to execute arbitrarily in ICO files. Existing steganalysis and sanitization defenses are discussed, highlighting limitations in detecting or neutralizing alpha-channel exploits. The results demonstrate a stealthy and reusable attack surface that blurs traditional boundaries between static images and executable content. Because modern browsers report silent errors when developers specifically fail to load ICO files, this attack surface offers an interesting example of required web behaviors that in turn compromise security.
中文摘要:本文提出了一种利用ICO图标文件透明度通道隐藏可执行JavaScript代码的新方法,通过浏览器自动加载网站图标的行为创建隐蔽攻击通道,成功绕过了内容安全策略和反病毒检测等传统防护措施。
English Summary: This paper introduces a covert method of embedding executable JavaScript code within ICO favicon images using alpha channel steganography, creating a stealthy attack vector that bypasses traditional security measures by leveraging browsers' automatic favicon loading behavior.
Authors:Preslav Aleksandrov, Meghdad Kurmanji, Fernando Garcia Redondo, David O'Shea, William Shen, Alex Iacob, Lorenzo Sani, Xinchi Qiu, Nicola Cancedda, Nicholas D. Lane
Abstract:
We introduce the Autoregressive Block-Based Iterative Encoder (AbbIE), a novel recursive generalization of the encoder-only Transformer architecture, which achieves better perplexity than a standard Transformer and allows for the dynamic scaling of compute resources at test time. This simple, recursive approach is a complement to scaling large language model (LLM) performance through parameter and token counts. AbbIE performs its iterations in latent space, but unlike latent reasoning models, does not require a specialized dataset or training protocol. We show that AbbIE upward generalizes (ability to generalize to arbitrary iteration lengths) at test time by only using 2 iterations during train time, far outperforming alternative iterative methods. AbbIE's ability to scale its computational expenditure based on the complexity of the task gives it an up to \textbf{12\%} improvement in zero-shot in-context learning tasks versus other iterative and standard methods and up to 5\% improvement in language perplexity. The results from this study open a new avenue to Transformer performance scaling. We perform all of our evaluations on model sizes up to 350M parameters.
中文:AbbIE是一种递归Transformer编码器,在困惑度和零样本任务上优于标准模型,并能动态调整计算资源,为性能提升提供了无需额外训练的新途径。
English: AbbIE is a recursive Transformer encoder that outperforms standard models in perplexity and zero-shot tasks while enabling dynamic compute scaling, offering a new approach to performance enhancement without extra training requirements.
Authors:Liu He, Xiao Zeng, Yizhi Song, Albert Y. C. Chen, Lu Xia, Shashwat Verma, Sankalp Dayal, Min Sun, Cheng-Hao Kuo, Daniel Aliaga
Abstract:
Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.
中文: 多模态大语言模型在捕捉相机与物体关系方面存在困难,但通过提出的合成生成流程构建的大规模3D视觉指令数据集Ultimate3D,使模型在关系识别任务中的准确率大幅提升了33.4%。
English: Multimodal Large Language Models face challenges in accurately capturing camera-object relations, but the proposed synthetic generation pipeline creates a large-scale 3D visual instruction dataset called Ultimate3D, which significantly improves model performance by 33.4% in recognition tasks.
Authors:Yongjian Zhang, Longguang Wang, Kunhong Li, Ye Zhang, Yun Wang, Liang Lin, Yulan Guo
Abstract:
This work presents PanMatch, a versatile foundation model for robust correspondence matching. Unlike previous methods that rely on task-specific architectures and domain-specific fine-tuning to support tasks like stereo matching, optical flow or feature matching, our key insight is that any two-frame correspondence matching task can be addressed within a 2D displacement estimation framework using the same model weights. Such a formulation eliminates the need for designing specialized unified architectures or task-specific ensemble models. Instead, it achieves multi-task integration by endowing displacement estimation algorithms with unprecedented generalization capabilities. To this end, we highlight the importance of a robust feature extractor applicable across multiple domains and tasks, and propose the feature transformation pipeline that leverage all-purpose features from Large Vision Models to endow matching baselines with zero-shot cross-view matching capabilities. Furthermore, we assemble a cross-domain dataset with near 1.8 million samples from stereo matching, optical flow, and feature matching domains to pretrain PanMatch. We demonstrate the versatility of PanMatch across a wide range of domains and downstream tasks using the same model weights. Our model outperforms UniMatch and Flow-Anything on cross-task evaluations, and achieves comparable performance to most state-of-the-art task-specific algorithms on task-oriented benchmarks. Additionally, PanMatch presents unprecedented zero-shot performance in abnormal scenarios, such as rainy day and satellite imagery, where most existing robust algorithms fail to yield meaningful results.
中文: PanMatch是一种多功能基础模型,通过统一的二维位移估计框架和固定模型权重实现跨任务的鲁棒对应匹配,在交叉任务评估中表现优异,并在恶劣场景下展现出前所未有的零样本适应能力。
English: PanMatch is a versatile foundation model that achieves robust correspondence matching across various tasks using a unified 2D displacement estimation framework with consistent model weights, demonstrating superior cross-task performance and unprecedented zero-shot capabilities in challenging scenarios.
Authors:Zeyang Sha, Hanling Tian, Zhuoer Xu, Shiwen Cui, Changhua Meng, Weiqiang Wang
Abstract:
The emergence of autonomous Large Language Model (LLM) agents capable of tool usage has introduced new safety risks that go beyond traditional conversational misuse. These agents, empowered to execute external functions, are vulnerable to both user-initiated threats (e.g., adversarial prompts) and tool-initiated threats (e.g., malicious outputs from compromised tools). In this paper, we propose the first unified safety-alignment framework for tool-using agents, enabling models to handle both channels of threat via structured reasoning and sandboxed reinforcement learning. We introduce a tri-modal taxonomy, including benign, malicious, and sensitive for both user prompts and tool responses, and define a policy-driven decision model. Our framework employs a custom-designed sandbox environment that simulates real-world tool execution and allows fine-grained reward shaping. Through extensive evaluations on public and self-built benchmarks, including Agent SafetyBench, InjecAgent, and BFCL, we demonstrate that our safety-aligned agents significantly improve resistance to security threats while preserving strong utility on benign tasks. Our results show that safety and effectiveness can be jointly optimized, laying the groundwork for trustworthy deployment of autonomous LLM agents.
中文: 本文提出首个针对工具使用型智能体的统一安全对齐框架,通过结构化推理和沙盒强化学习应对用户与工具的双重威胁,在保持任务效能的同时显著提升安全防护能力。
English: This paper introduces a unified safety-alignment framework for autonomous LLM agents that addresses dual threats from users and tools through structured reasoning and sandboxed reinforcement learning, significantly enhancing security while maintaining task performance.
Authors:Shuang Zhou, Wenya Xie, Jiaxi Li, Zaifu Zhan, Meijia Song, Han Yang, Cheyenna Espinoza, Lindsay Welton, Xinnie Mai, Yanwei Jin, Zidu Xu, Yuen-Hei Chung, Yiyun Xing, Meng-Han Tsai, Emma Schaffer, Yucheng Shi, Ninghao Liu, Zirui Liu, Rui Zhang
Abstract:
As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs' medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs' medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs' medical reasoning, advancing their safe and responsible deployment in clinical practice.
中文: MedThink-Bench作为严谨的医学推理评估基准,通过500道跨领域难题和专家解析,结合LLM-w-Ref框架实现可扩展的高精度评估,实验证明较小模型可超越大型商用模型,为临床安全部署提供基础工具。
English: MedThink-Bench is a rigorous benchmark for evaluating large language models' medical reasoning through 500 challenging questions and expert rationales, supported by the LLM-w-Ref framework that ensures scalable, high-fidelity assessment and reveals smaller models can outperform larger ones.
Authors:Youngjoon Jang, Seongtae Hong, Junyoung Son, Sungjin Park, Chanjun Park, Heuiseok Lim
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a crucial framework in natural language processing (NLP), improving factual consistency and reducing hallucinations by integrating external document retrieval with large language models (LLMs). However, the effectiveness of RAG is often hindered by coreferential complexity in retrieved documents, introducing ambiguity that disrupts in-context learning. In this study, we systematically investigate how entity coreference affects both document retrieval and generative performance in RAG-based systems, focusing on retrieval relevance, contextual understanding, and overall response quality. We demonstrate that coreference resolution enhances retrieval effectiveness and improves question-answering (QA) performance. Through comparative analysis of different pooling strategies in retrieval tasks, we find that mean pooling demonstrates superior context capturing ability after applying coreference resolution. In QA tasks, we discover that smaller models benefit more from the disambiguation process, likely due to their limited inherent capacity for handling referential ambiguity. With these findings, this study aims to provide a deeper understanding of the challenges posed by coreferential complexity in RAG, providing guidance for improving retrieval and generation in knowledge-intensive AI applications.
中文摘要:本研究表明,在检索增强生成系统中,指代消解能显著提升检索效果和问答性能,尤其有利于处理指代歧义能力有限的小型模型。
English Summary: This study demonstrates that coreference resolution significantly enhances retrieval effectiveness and question-answering performance in Retrieval-Augmented Generation systems, particularly benefiting smaller models with limited capacity for handling referential ambiguity.
Authors:Binquan Zhang, Li Zhang, Zhiwen Luo, Yuxin Du, Fang Liu, Song Wang, Lin Shi
Abstract:
Large language models (LLMs) have demonstrated impressive performance in code generation, particularly when augmented with chain-of-thought (CoT) prompting techniques. They break down requirements into intermediate reasoning steps, which act as design rationales to guide LLMs in writing code like human programmers. Thus, the quality of these steps is crucial for ensuring the correctness and reliability of the generated code. However, little is known about the quality of CoT generated by LLMs. To what extent can we trust the thoughts generated by LLMs? How good are they? This paper empirically explores the external and internal factors of why LLMs generate unsatisfactory CoTs by analyzing 1,023 failed code samples on two widely used code generation benchmarks. We also evaluate their impact on code generation performance by analyzing 210 CoT-code pairs and refining the unsatisfied CoTs by prompting LLMs. Our study reveals three key findings: (1) External factors (53.60%), such as unclear requirements and lack of context, mainly affect CoT quality, while internal factors (40.10%) stem from LLMs' misunderstanding prompts. (2) Even when CoTs are correct, 18.5% of the generated code contains errors due to instruction-following issues; conversely, 11.90% of correct code is paired with flawed CoTs. (3) Refining low-quality CoTs is feasible, i.e., LLMs improve when given detailed problem descriptions. These findings highlight key challenges in CoT-based code generation and suggest directions for improving LLM reasoning and reliability.
中文摘要:大语言模型在代码生成中的思维链质量主要受外部需求不明确和内部模型误解影响,但通过细化问题描述的提示方式可有效改进推理过程与代码可靠性。
English Summary: Large language models' chain-of-thought reasoning for code generation faces quality issues primarily from unclear requirements and model misunderstandings, but refining these thoughts through detailed prompts can significantly improve performance.
Authors:Hao Tang, Ling Shao, Zhenyu Zhang, Luc Van Gool, Nicu Sebe
Abstract:
We propose a novel spatial-temporal graph Mamba (STG-Mamba) for the music-guided dance video synthesis task, i.e., to translate the input music to a dance video. STG-Mamba consists of two translation mappings: music-to-skeleton translation and skeleton-to-video translation. In the music-to-skeleton translation, we introduce a novel spatial-temporal graph Mamba (STGM) block to effectively construct skeleton sequences from the input music, capturing dependencies between joints in both the spatial and temporal dimensions. For the skeleton-to-video translation, we propose a novel self-supervised regularization network to translate the generated skeletons, along with a conditional image, into a dance video. Lastly, we collect a new skeleton-to-video translation dataset from the Internet, containing 54,944 video clips. Extensive experiments demonstrate that STG-Mamba achieves significantly better results than existing methods.
中文:提出的STG-Mamba模型通过双重转换过程从音乐合成舞蹈视频,采用创新的时空图Mamba模块生成骨架序列,并结合自监督网络进行视频渲染,在实验中显著优于现有方法。
English: The proposed STG-Mamba model synthesizes dance videos from music through a dual translation process, utilizing a novel spatial-temporal graph Mamba block for skeleton generation and a self-supervised network for video rendering, achieving superior performance over existing methods.
Authors:Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, Hao Li
Abstract:
Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.
The abstract introduces Omni-Video, a unified framework that bridges the gap in video understanding and generation by enabling multimodal large language models to produce continuous visual clues for diffusion decoders, achieving high-quality video outputs with efficient architectural and training innovations.
English Summary:
Authors:Changfeng Gao, Zhihao Du, Shiliang Zhang
Abstract:
This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to conventional reinforcement learning from human feedback (RLHF) approaches applied to TTS, DiffRO directly compute the rewards based on neural codec tokens, rather than relying on synthesized audio. Furthermore, we employ the Gumbel-Softmax technique to render the reward function differentiable, thereby streamlining the RLHF training process. Additionally, we introduce a multi-task reward (MTR) model which can provide feedback from different perspectives and find that it can augment the system's capability to follow instructions effectively.Experimental results indicate that DiffRO significantly improves the pronunciation accuracy of the TTS system, achieving state-of-the-art (SOTA) WER results on the seed-tts-eval benchmark. Moreover, with the integration of the MTR model, we demonstrate the ability to control emotional and quality attributes in a zero-shot manner.
中文: 本文提出DiffRO方法,通过直接从神经编解码器令牌计算可微分奖励并引入多任务奖励模型,显著提升了语音合成系统的发音准确性,实现了在零样本条件下对情感与质量属性的有效控制。
English: This paper introduces DiffRO, a novel method that optimizes neural codec-based TTS systems by computing differentiable rewards directly from tokens and incorporating a multi-task reward model, significantly improving pronunciation accuracy and enabling zero-shot control over emotional and quality attributes.
Authors:Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, Liqiang Nie
Abstract:
Composed Image Retrieval (CIR) represents a novel retrieval paradigm that is capable of expressing users' intricate retrieval requirements flexibly. It enables the user to give a multimodal query, comprising a reference image and a modification text, and subsequently retrieve the target image. Notwithstanding the considerable advances made by prevailing methodologies, CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation, and 2) the priority of textual data in the image modification process is overlooked, which leads to a visual focus bias. To address these two limitations, this work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping. It is designed to identify significant dominant portions in images and guide the extraction of visual and textual data features, thereby reducing the impact of noise interference. Subsequently, we propose a textually guided focus revision module, which can utilize the modification requirements implied in the text to perform adaptive focus revision on the reference image, thereby enhancing the perception of the modification focus on the composed features. The aforementioned modules collectively constitute the segmentatiOn-based Focus shiFt reviSion nETwork (\mbox{OFFSET}), and comprehensive experiments on four benchmark datasets substantiate the superiority of our proposed method. The codes and data are available on https://zivchen-ty.github.io/OFFSET.github.io/
中文: 本文提出OFFSET网络,通过焦点映射识别图像主体区域并利用文本引导的焦点修正来增强组合特征感知,有效解决了组合图像检索中的特征退化与视觉焦点偏差问题,在多个基准数据集上验证了其优越性。
English: The paper introduces OFFSET, a novel network that addresses key limitations in Composed Image Retrieval by using focus mapping to identify dominant image portions and text-guided revision to enhance modification focus, demonstrating superior performance across multiple datasets.
Authors:Shiting Xiao, Rishabh Kabra, Yuhang Li, Donghyun Lee, Joao Carreira, Priyadarshini Panda
Abstract:
The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model's spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks, including ADE20k, PASCAL, ScanNet, and SUN-RGBD.
中文摘要:OpenWorldSAM通过集成轻量级视觉语言模型扩展了SAM2,利用统一提示、高效训练、实例感知和强大泛化能力实现了开放词汇分割,在多个基准测试中达到了最先进的性能。
English Summary: OpenWorldSAM extends SAM2 with a lightweight vision-language model to enable open-vocabulary segmentation through unified prompting, efficient training, instance awareness, and strong generalization, achieving state-of-the-art performance across multiple benchmarks.
Authors:Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, CÃan Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick, Mercy Asiedu, Ines Mezerreg, Howard Hu, Howard Yang, Richa Tiwari, Sunny Jansen, Preeti Singh, Yun Liu, Shekoofeh Azizi, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riviere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Elena Buchatskaya, Jean-Baptiste Alayrac, Dmitry Lepikhin, Vlad Feinberg, Sebastian Borgeaud, Alek Andreev, Cassidy Hardin, Robert Dadashi, Léonard Hussenot, Armand Joulin, Olivier Bachem, Yossi Matias, Katherine Chou, Avinatan Hassidim, Kavi Goel, Clement Farabet, Joelle Barral, Tris Warkentin, Jonathon Shlens, David Fleet, Victor Cotruta, Omar Sanseviero, Gus Martins, Phoebe Kirk, Anand Rao, Shravya Shetty, David F. Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, Lin Yang
Abstract:
Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.
中文:MedGemma作为医疗视觉语言基础模型,在医学理解和推理方面展现出卓越性能,显著提升了医疗AI应用水平,同时保持了通用能力。
English: MedGemma is a medical vision-language foundation model that demonstrates superior performance in medical understanding and reasoning, significantly advancing healthcare AI applications while maintaining general capabilities.
Authors:Konstantinos Foteinos, Jorgen Cani, Manousos Linardakis, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos
Abstract:
The rapid evolution of deep learning (DL) models and the ever-increasing size of available datasets have raised the interest of the research community in the always important field of vision-based hand gesture recognition (VHGR), and delivered a wide range of applications, such as sign language understanding and human-computer interaction using cameras. Despite the large volume of research works in the field, a structured and complete survey on VHGR is still missing, leaving researchers to navigate through hundreds of papers in order to find the right combination of data, model, and approach for each task. The current survey aims to fill this gap by presenting a comprehensive overview of this aspect of computer vision. With a systematic research methodology that identifies the state-of-the-art works and a structured presentation of the various methods, datasets, and evaluation metrics, this review aims to constitute a useful guideline for researchers, helping them to choose the right strategy for delving into a certain VHGR task. Starting with the methodology used for study selection, literature retrieval, and the analytical framing, the survey identifies and organizes key VHGR approaches using a taxonomy-based format in various dimensions such as input modality and application domain. The core of the survey provides an in-depth analysis of state-of-the-art techniques across three primary VHGR tasks: static gesture recognition, isolated dynamic gestures and continuous gesture recognition. For each task, the architectural trends and learning strategies are listed. Additionally, the study reviews commonly used datasets - emphasizing on annotation schemes - and evaluates standard performance metrics. It concludes by identifying major challenges in VHGR, including both general computer vision issues and domain-specific obstacles, and outlines promising directions for future research.
中文摘要:本综述系统梳理了基于视觉的手势识别领域,通过分类整理前沿方法、数据集和评估指标,为研究人员选择合适策略提供指导,并指出了当前挑战与未来研究方向。
English Summary: This survey provides a comprehensive overview of vision-based hand gesture recognition, systematically organizing state-of-the-art methods, datasets, and metrics to guide researchers in selecting appropriate approaches for various tasks while identifying current challenges and future directions.
Authors:Giuseppe Silano, Daniel Bonilla Licea, Hajar El Hammouti, Martin Saska
Abstract:
This paper introduces a Nonlinear Model Predictive Control (NMPC) framework for communication-aware motion planning of Multi-Rotor Aerial Vehicles (MRAVs) using Free-Space Optical (FSO) links. The scenario involves MRAVs equipped with body-fixed optical transmitters and Unmanned Ground Vehicles (UGVs) acting as mobile relays, each outfitted with fixed conical Field-of-View (FoV) receivers. The controller integrates optical connectivity constraints into the NMPC formulation to ensure beam alignment and minimum link quality, while also enabling UGV tracking and obstacle avoidance. The method supports both coplanar and tilted MRAV configurations. MATLAB simulations demonstrate its feasibility and effectiveness.
中文: 本文提出了一种非线性模型预测控制框架,用于采用自由空间光通信的多旋翼飞行器的通信感知运动规划,该框架集成了光束对准和避障的连接约束,并支持多种飞行器配置。
English: This paper presents a Nonlinear Model Predictive Control framework for communication-aware motion planning of multi-rotor aerial vehicles using free-space optical links, integrating connectivity constraints for beam alignment and obstacle avoidance while supporting various vehicle configurations.
Authors:Tianyao He, Runqi Wang, Yang Chen, Dejia Song, Nemo Chen, Xu Tang, Yao Hu
Abstract:
Text-driven portrait editing holds significant potential for various applications but also presents considerable challenges. An ideal text-driven portrait editing approach should achieve precise localization and appropriate content modification, yet existing methods struggle to balance reconstruction fidelity and editing flexibility. To address this issue, we propose Flux-Sculptor, a flux-based framework designed for precise text-driven portrait editing. Our framework introduces a Prompt-Aligned Spatial Locator (PASL) to accurately identify relevant editing regions and a Structure-to-Detail Edit Control (S2D-EC) strategy to spatially guide the denoising process through sequential mask-guided fusion of latent representations and attention values. Extensive experiments demonstrate that Flux-Sculptor surpasses existing methods in rich-attribute editing and facial information preservation, making it a strong candidate for practical portrait editing applications. Project page is available at https://flux-sculptor.github.io/.
中文: Flux-Sculptor是一种基于通量的创新框架,通过提示对齐空间定位器和结构到细节编辑控制策略,实现了精准的文本驱动人像编辑,在编辑灵活性和面部信息保留方面表现卓越。
English: Flux-Sculptor is a novel framework that achieves precise text-driven portrait editing through a Prompt-Aligned Spatial Locator and a Structure-to-Detail Edit Control strategy, excelling in both editing flexibility and facial preservation.
Authors:Nuo Chen, Pu Yan, Jia Li, Qixuan Zhao
Abstract:
Since its viral emergence in early 2024, Comment Robert-a Weibo-launched social chatbot-has gained widespread attention on the Chinese Internet for its unsolicited and unpredictable comments on user posts. Unlike conventional chatbots that respond only to user prompts, Robert autonomously intervenes in public discourse, representing a novel form of AI-driven social media engagement. This study examines how such autonomous, algorithmic communication reshapes human-AI interaction in everyday online contexts. Using computational linguistics techniques, including topic classification and sentiment analysis, we analyze over 3,900 user-submitted interactions from the "Robert Victims Alliance", a grassroots community documenting their exchanges with the chatbot. Topic modeling reveals six key themes: interpersonal relationships, self-identity, academic and career concerns, subcultures, sensitive topics, and social events. Complementing this, mixed-methods emotional analysis uncovers a complex affective spectrum: Robert's casual remarks can evoke warmth and humor but may also conceal covert hostility beneath neutral or polite language. These ambivalent interactions reveal an emerging emotional divide between humans and socially proactive AI, suggesting that while Robert simulates social presence, it often falls short of users' emotional needs. Our study contributes to human-AI interaction research by offering new insights into the affective dynamics and socio-technical implications of unsolicited AI bots' participation in digital public spheres.
中文:本研究通过计算语言学分析微博聊天机器人Comment Robert的主动评论行为,发现其虽能通过多样主题和复杂情感模拟社交存在,但常无法满足用户情感需求,揭示了数字公共领域中人与主动式AI间日益显现的情感鸿沟。
English: The study analyzes the Comment Robert chatbot's unsolicited interventions on Weibo, revealing through computational linguistics that while it simulates social presence through varied topics and mixed emotions, it often fails to meet users' emotional needs, highlighting an emerging human-AI affective divide in digital public spheres.
Authors:Luca A. Lanzendörfer, Florian Grötschla, Karim Galal, Roger Wattenhofer
Abstract:
We present MaskBeat, a transformer-based approach for loopable drum pattern generation. Rather than predicting drum hits sequentially, our method uses bidirectional attention with iterative refinement, allowing instruments to be generated in parallel while maintaining musical coherence. Additionally, we introduce custom loss functions that capture drum-specific musical relationships. Our experiments show that MaskBeat generates higher quality and more musically coherent drum patterns than baseline approaches.
中文: MaskBeat是一种基于Transformer的方法,通过双向注意力和定制损失函数生成可循环的鼓点模式,相比基线方法能产生质量更高、音乐连贯性更强的节奏。
English: MaskBeat is a transformer-based method that generates loopable drum patterns using bidirectional attention and custom loss functions, producing higher quality and more coherent results than baselines.
Authors:Wanru Zhao, Yihong Chen, Royson Lee, Xinchi Qiu, Yan Gao, Hongxiang Fan, Nicholas D. Lane
Abstract:
Pre-trained large language models (LLMs) have become a cornerstone of modern natural language processing, with their capabilities extending across a wide range of applications and languages. However, the fine-tuning of multilingual LLMs, especially for low-resource languages, faces significant challenges arising from data-sharing restrictions (the physical border) and inherent linguistic differences (the linguistic border). These barriers hinder users of various languages, particularly those in low-resource regions, from fully benefiting from the advantages of LLMs. To address these challenges, we propose the Federated Prompt Tuning Paradigm for multilingual scenarios, which utilizes parameter-efficient fine-tuning while adhering to data sharing restrictions. We design a comprehensive set of experiments and analyze them using a novel notion of language distance to highlight the strengths of our paradigm: Even under computational constraints, our method not only improves data efficiency but also facilitates mutual enhancements across languages, particularly benefiting low-resource ones. Compared to traditional local cross-lingual transfer tuning methods, our approach achieves 6.9\% higher accuracy with improved data efficiency, and demonstrates greater stability and generalization. These findings underscore the potential of our approach to promote social equality and champion linguistic diversity, ensuring that no language is left behind.
中文: 联邦提示调优范式通过提升数据效率和跨语言协作,解决了多语言大语言模型调优的难题,在低资源语言上实现了6.9%的准确率提升和更强的泛化能力。
English: The Federated Prompt Tuning Paradigm addresses multilingual LLM fine-tuning challenges by enhancing data efficiency and cross-lingual collaboration, achieving 6.9% higher accuracy and better generalization for low-resource languages.
Authors:Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang
Abstract:
Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These benchmarks typically measure agent capabilities by evaluating task outcomes via specific reward designs. However, we show that many agentic benchmarks have issues in task setup or reward design. For example, SWE-bench Verified uses insufficient test cases, while TAU-bench counts empty responses as successful. Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVE-Bench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.
中文: 许多智能体基准存在任务设置或奖励设计的缺陷,可能导致AI性能评估偏差高达100%,为此我们提出了智能体基准检查表(ABC),通过规范评估设计提升严谨性,如在CVE-Bench中应用该检查表使性能高估降低了33%。
English: Many agentic benchmarks suffer from flawed task setups or reward designs that can misrepresent AI performance by up to 100%, prompting the introduction of the Agentic Benchmark Checklist (ABC) to enhance evaluation rigor, as demonstrated by its 33% reduction in performance overestimation in CVE-Bench.
Authors:Zhaoyu Zhang, Lingyi Wang, Wei Wu, Fuhui Zhou, Qihui Wu
Abstract:
Data-driven semantic communication is based on superficial statistical patterns, thereby lacking interpretability and generalization, especially for applications with the presence of unseen data. To address these challenges, we propose a novel knowledge graph-enhanced zero-shot semantic communication (KGZS-SC) network. Guided by the structured semantic information from a knowledge graph-based semantic knowledge base (KG-SKB), our scheme provides generalized semantic representations and enables reasoning for unseen cases. Specifically, the KG-SKB aligns the semantic features in a shared category semantics embedding space and enhances the generalization ability of the transmitter through aligned semantic features, thus reducing communication overhead by selectively transmitting compact visual semantics. At the receiver, zero-shot learning (ZSL) is leveraged to enable direct classification for unseen cases without the demand for retraining or additional computational overhead, thereby enhancing the adaptability and efficiency of the classification process in dynamic or resource-constrained environments. The simulation results conducted on the APY datasets show that the proposed KGZS-SC network exhibits robust generalization and significantly outperforms existing SC frameworks in classifying unseen categories across a range of SNR levels.
中文: 提出的知识图谱增强零样本语义通信网络利用结构化语义信息,提升了对未知数据的泛化与推理能力,在不同信噪比水平下对未见类别的分类性能显著优于现有框架。
English: The proposed knowledge graph-enhanced zero-shot semantic communication network leverages structured semantic information to improve generalization and reasoning for unseen data, significantly outperforming existing frameworks in classification tasks across various SNR levels.
Authors:Florian Grötschla, Luca A. Lanzendörfer, Longxiang Jiao, Roger Wattenhofer
Abstract:
We introduce PANAMA, an active learning framework for the training of end-to-end parametric guitar amp models using a WaveNet-like architecture. With \model, one can create a virtual amp by recording samples that are determined by an active learning strategy to use a minimum amount of datapoints (i.e., amp knob settings). We show that gradient-based optimization algorithms can be used to determine the optimal datapoints to sample, and that the approach helps under a constrained number of samples.
中文: PANAMA是一种主动学习框架,采用类似WaveNet的架构训练参数化吉他放大器模型,通过基于梯度的优化算法选择最佳采样点,显著减少了所需数据量。
English: PANAMA is an active learning framework that efficiently trains parametric guitar amp models using a WaveNet-like architecture, minimizing data requirements through gradient-based optimization for optimal sample selection.
Authors:Hao Liu, Bo Yang, Zhiwen Yu, Xuelin Cao, George C. Alexandropoulos, Yan Zhang, Chau Yuen
Abstract:
The Internet of Vehicles (IoV) transforms the transportation ecosystem promising pervasive connectivity and data-driven approaches. Deep learning and generative Artificial Intelligence (AI) have the potential to significantly enhance the operation of applications within IoV by facilitating efficient decision-making and predictive capabilities, including intelligent navigation, vehicle safety monitoring, accident prevention, and intelligent traffic management. Nevertheless, efficiently transmitting and processing the massive volumes of data generated by the IoV in real-time remains a significant challenge, particularly in dynamic and unpredictable wireless channel conditions. To address these challenges, this paper proposes a semantic communication framework based on channel perception to improve the accuracy and efficiency of data transmission. The semantic communication model extracts and compresses the information to be transmitted. In addition, the wireless channel is estimated by using a generative diffusion model, which is employed to predict the dynamic channel states, thereby improving the quality of IoV service. In dynamic scenarios, however, the channel estimation performance may be degraded when substantially new scenarios take place, which will adversely affect user experience. To mitigate this limitation, we employ a large model to fine-tune the channel generation model to enhance its adaptability for varying scenarios. The performance and reliability of the proposed framework are evaluated on the two public datasets.
中文摘要:本文提出了一种基于信道感知的语义通信框架,通过生成扩散模型进行信道估计并利用大模型微调,以提高车联网动态场景下的数据传输准确性和适应性。
English Summary: This paper introduces a channel-aware semantic communication framework for the Internet of Vehicles (IoV), utilizing generative diffusion models for channel estimation and large models for fine-tuning to enhance data transmission accuracy and adaptability across dynamic scenarios.
Authors:Wenbo Yu, Anirbit Ghosh, Tobias Sebastian Finn, Rossella Arcucci, Marc Bocquet, Sibo Cheng
Abstract:
Thanks to recent advances in generative AI, computers can now simulate realistic and complex natural processes. We apply this capability to predict how wildfires spread, a task made difficult by the unpredictable nature of fire and the variety of environmental conditions it depends on. In this study, We present the first denoising diffusion model for predicting wildfire spread, a new kind of AI framework that learns to simulate fires not just as one fixed outcome, but as a range of possible scenarios. By doing so, it accounts for the inherent uncertainty of wildfire dynamics, a feature that traditional models typically fail to represent. Unlike deterministic approaches that generate a single prediction, our model produces ensembles of forecasts that reflect physically meaningful distributions of where fire might go next. This technology could help us develop smarter, faster, and more reliable tools for anticipating wildfire behavior, aiding decision-makers in fire risk assessment and response planning.
中文: 本研究首次提出用于野火蔓延预测的去噪扩散模型,通过模拟多种可能情景来捕捉火灾动态的内在不确定性,生成能反映物理合理分布的集合预报,从而提升风险评估和响应决策的可靠性。
English: This study introduces the first denoising diffusion model for wildfire spread prediction, which simulates a range of possible scenarios to capture inherent uncertainties, producing ensemble forecasts that improve risk assessment and response planning.
Authors:Ziyang Chen, Huimu Yu, Xing Wu, Dongqin Liu, Songlin Hu
Abstract:
Large language models (LLMs) excel in text understanding and generation but raise significant safety and ethical concerns in high-stakes applications. To mitigate these risks, we present Libra-Guard, a cutting-edge safeguard system designed to enhance the safety of Chinese-based LLMs. Leveraging a two-stage curriculum training pipeline, Libra-Guard enhances data efficiency by employing guard pretraining on synthetic samples, followed by fine-tuning on high-quality, real-world data, thereby significantly reducing reliance on manual annotations. To enable rigorous safety evaluations, we also introduce Libra-Test, the first benchmark specifically designed to evaluate the effectiveness of safeguard systems for Chinese content. It covers seven critical harm scenarios and includes over 5,700 samples annotated by domain experts. Experiments show that Libra-Guard achieves 86.79% accuracy, outperforming Qwen2.5-14B-Instruct (74.33%) and ShieldLM-Qwen-14B-Chat (65.69%), and nearing closed-source models like Claude-3.5-Sonnet and GPT-4o. These contributions establish a robust framework for advancing the safety governance of Chinese LLMs and represent a tentative step toward developing safer, more reliable Chinese AI systems.
中文: Libra-Guard是一种先进的防护系统,通过两阶段训练流程提升中文大模型的安全性并实现高准确率,而Libra-Test则是首个专门评估中文防护系统效能的基准测试。
English: Libra-Guard is an advanced safeguard system that enhances the safety of Chinese LLMs through a two-stage training pipeline and achieves high accuracy, while Libra-Test serves as the first benchmark for evaluating such systems in Chinese contexts.
Authors:Rongyao Cai, Ming Jin, Qingsong Wen, Kexin Zhang
Abstract:
Domain shift poses a fundamental challenge in time series analysis, where models trained on source domain often fail dramatically when applied in target domain with different yet similar distributions. While current unsupervised domain adaptation (UDA) methods attempt to align cross-domain feature distributions, they typically treat features as indivisible entities, ignoring their intrinsic compositions that govern domain adaptation. We introduce DARSD, a novel UDA framework with theoretical explainability that explicitly realizes UDA tasks from the perspective of representation space decomposition. Our core insight is that effective domain adaptation requires not just alignment, but principled disentanglement of transferable knowledge from mixed representations. DARSD consists of three synergistic components: (I) An adversarial learnable common invariant basis that projects original features into a domain-invariant subspace while preserving semantic content; (II) A prototypical pseudo-labeling mechanism that dynamically separates target features based on confidence, hindering error accumulation; (III) A hybrid contrastive optimization strategy that simultaneously enforces feature clustering and consistency while mitigating emerging distribution gaps. Comprehensive experiments conducted on four benchmarks (WISDM, HAR, HHAR, and MFD) demonstrate DARSD's superiority against 12 UDA algorithms, achieving optimal performance in 35 out of 53 scenarios and ranking first across all benchmarks.
中文摘要:本文提出DARSD框架,通过表示空间分解视角实现无监督领域自适应,利用对抗学习、原型伪标记和混合对比优化来分离可迁移知识,在四个基准测试中显著优于现有方法。
English Summary: The paper introduces DARSD, a novel unsupervised domain adaptation framework that addresses domain shift in time series by decomposing representations into transferable components through adversarial learning, prototypical pseudo-labeling, and contrastive optimization, achieving state-of-the-art performance across multiple benchmarks.
Authors:Xinyi Li, Zongyi Li, Nikola Kovachki, Anima Anandkumar
Abstract:
We propose integrating optimal transport (OT) into operator learning for partial differential equations (PDEs) on complex geometries. Classical geometric learning methods typically represent domains as meshes, graphs, or point clouds. Our approach generalizes discretized meshes to mesh density functions, formulating geometry embedding as an OT problem that maps these functions to a uniform density in a reference space. Compared to previous methods relying on interpolation or shared deformation, our OT-based method employs instance-dependent deformation, offering enhanced flexibility and effectiveness. For 3D simulations focused on surfaces, our OT-based neural operator embeds the surface geometry into a 2D parameterized latent space. By performing computations directly on this 2D representation of the surface manifold, it achieves significant computational efficiency gains compared to volumetric simulation. Experiments with Reynolds-averaged Navier-Stokes equations (RANS) on the ShapeNet-Car and DrivAerNet-Car datasets show that our method achieves better accuracy and also reduces computational expenses in terms of both time and memory usage compared to existing machine learning models. Additionally, our model demonstrates significantly improved accuracy on the FlowBench dataset, underscoring the benefits of employing instance-dependent deformation for datasets with highly variable geometries.
中文: 本研究提出了一种基于最优传输的算子学习方法,用于复杂几何上的偏微分方程,通过实例依赖变形将网格密度函数映射到均匀参考空间,在三维模拟中显著提升了精度和计算效率。
English: This study introduces an optimal transport-based operator learning method for PDEs on complex geometries, which maps mesh density functions to a uniform reference space via instance-dependent deformation, improving accuracy and computational efficiency in 3D simulations.
Authors:Ziyin Xiong, Yinghan Chen, Puhao Li, Yixin Zhu, Tengyu Liu, Siyuan Huang
Abstract:
Bimanual manipulation, fundamental to human daily activities, remains a challenging task due to its inherent complexity of coordinated control. Recent advances have enabled zero-shot learning of single-arm manipulation skills through agent-agnostic visual representations derived from human videos; however, these methods overlook crucial agent-specific information necessary for bimanual coordination, such as end-effector positions. We propose Ag2x2, a computational framework for bimanual manipulation through coordination-aware visual representations that jointly encode object states and hand motion patterns while maintaining agent-agnosticism. Extensive experiments demonstrate that Ag2x2 achieves a 73.5% success rate across 13 diverse bimanual tasks from Bi-DexHands and PerAct2, including challenging scenarios with deformable objects like ropes. This performance outperforms baseline methods and even surpasses the success rate of policies trained with expert-engineered rewards. Furthermore, we show that representations learned through Ag2x2 can be effectively leveraged for imitation learning, establishing a scalable pipeline for skill acquisition without expert supervision. By maintaining robust performance across diverse tasks without human demonstrations or engineered rewards, Ag2x2 represents a step toward scalable learning of complex bimanual robotic skills.
中文: Ag2x2提出了一种协调感知的视觉表征框架,在无需专家监督或人工设计奖励的情况下,在多种双手操作任务中实现了卓越性能,标志着向可扩展机器人技能学习迈出了重要一步。
English: Ag2x2 introduces a coordination-aware visual representation framework that achieves superior performance in diverse bimanual manipulation tasks without requiring expert supervision or engineered rewards, marking progress toward scalable robotic skill acquisition.
Authors:Yue Wu, Xiaolan Chen, Weiyi Zhang, Shunming Liu, Wing Man Rita Sum, Xinyuan Wu, Xianwen Shang, Chea-su Kee, Mingguang He, Danli Shi
Abstract:
Large language models (LLMs) show promise for tailored healthcare communication but face challenges in interpretability and multi-task integration particularly for domain-specific needs like myopia, and their real-world effectiveness as patient education tools has yet to be demonstrated. Here, we introduce ChatMyopia, an LLM-based AI agent designed to address text and image-based inquiries related to myopia. To achieve this, ChatMyopia integrates an image classification tool and a retrieval-augmented knowledge base built from literature, expert consensus, and clinical guidelines. Myopic maculopathy grading task, single question examination and human evaluations validated its ability to deliver personalized, accurate, and safe responses to myopia-related inquiries with high scalability and interpretability. In a randomized controlled trial (n=70, NCT06607822), ChatMyopia significantly improved patient satisfaction compared to traditional leaflets, enhancing patient education in accuracy, empathy, disease awareness, and patient-eyecare practitioner communication. These findings highlight ChatMyopia's potential as a valuable supplement to enhance patient education and improve satisfaction with medical services in primary eye care settings.
Chinese: ChatMyopia作为基于大语言模型的AI助手,通过整合图像分类工具和知识库,在近视相关咨询中提供个性化精准回复,显著提升了患者教育效果和眼科护理满意度,优于传统宣教方式。
English: ChatMyopia, an AI agent using large language models, effectively enhances patient education by providing personalized and accurate responses to myopia-related inquiries, significantly improving satisfaction and communication in eye care compared to traditional methods.
Authors:Niklas Bubeck, Yundi Zhang, Suprosanna Shit, Daniel Rueckert, Jiazhen Pan
Abstract:
In medical imaging, generative models are increasingly relied upon for two distinct but equally critical tasks: reconstruction, where the goal is to restore medical imaging (usually inverse problems like inpainting or superresolution), and generation, where synthetic data is created to augment datasets or carry out counterfactual analysis. Despite shared architecture and learning frameworks, they prioritize different goals: generation seeks high perceptual quality and diversity, while reconstruction focuses on data fidelity and faithfulness. In this work, we introduce a "generative model zoo" and systematically analyze how modern latent diffusion models and autoregressive models navigate the reconstruction-generation spectrum. We benchmark a suite of generative models across representative cardiac medical imaging tasks, focusing on image inpainting with varying masking ratios and sampling strategies, as well as unconditional image generation. Our findings show that diffusion models offer superior perceptual quality for unconditional generation but tend to hallucinate as masking ratios increase, whereas autoregressive models maintain stable perceptual performance across masking levels, albeit with generally lower fidelity.
Chinese Summary: 医学影像中生成模型承担重建与生成双重任务,扩散模型在无条件生成上感知质量更优但高掩蔽下易产生幻觉,而自回归模型虽保真度较低却能在不同掩蔽水平下保持稳定的感知性能。
English Summary: Generative models in medical imaging serve dual purposes of reconstruction and generation, with diffusion models excelling in perceptual quality for generation but prone to hallucination under high masking, while autoregressive models offer stable performance across masking levels at the cost of fidelity.
Authors:Xingyu Miao, Haoran Duan, Quanhao Qian, Jiuniu Wang, Yang Long, Ling Shao, Deli Zhao, Ran Xu, Gongjie Zhang
Abstract:
Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of large-scale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation. In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations - including point clouds, camera poses, depth maps, and pseudo-RGBD - via integrated depth estimation, camera calibration, and scale calibration. Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence. We release two generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various 3D tasks, ranging from fundamental perception to MLLM-based reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.
中文: 本研究提出了一种可扩展的流程,将单视角图像转换为逼真的三维表征,通过自动生成真实数据集解决三维数据稀缺问题,为推进人工智能的空间智能开辟了新途径。
English: This work introduces a scalable pipeline that converts single-view images into realistic 3D representations, addressing the scarcity of large-scale 3D data by automatically generating authentic datasets to advance spatial intelligence in AI.
Authors:Grace Su, Sheng-Yu Wang, Aaron Hertzmann, Eli Shechtman, Jun-Yan Zhu, Richard Zhang
Abstract:
A common and controversial use of text-to-image models is to generate pictures by explicitly naming artists, such as "in the style of Greg Rutkowski". We introduce a benchmark for prompted-artist recognition: predicting which artist names were invoked in the prompt from the image alone. The dataset contains 1.95M images covering 110 artists and spans four generalization settings: held-out artists, increasing prompt complexity, multiple-artist prompts, and different text-to-image models. We evaluate feature similarity baselines, contrastive style descriptors, data attribution methods, supervised classifiers, and few-shot prototypical networks. Generalization patterns vary: supervised and few-shot models excel on seen artists and complex prompts, whereas style descriptors transfer better when the artist's style is pronounced; multi-artist prompts remain the most challenging. Our benchmark reveals substantial headroom and provides a public testbed to advance the responsible moderation of text-to-image models. We release the dataset and benchmark to foster further research: https://graceduansu.github.io/IdentifyingPromptedArtists/
Chinese Summary: 本研究提出了一个从文本到图像模型生成的作品中识别艺术家风格的基准,通过评估多种方法在不同泛化场景下的表现,揭示了负责任AI内容审核方面仍有巨大提升空间。
English Summary: This study introduces a benchmark for recognizing artists from images generated by text-to-image models, evaluating various methods across different generalization settings and revealing significant room for improvement in responsible AI moderation.
Authors:Zezhou Yang, Ting Peng, Cuiyun Gao, Chaozheng Wang, Hailiang Huang, Yuetang Deng
Abstract:
Code completion, a crucial task in software engineering that enhances developer productivity, has seen substantial improvements with the rapid advancement of large language models (LLMs). In recent years, retrieval-augmented generation (RAG) has emerged as a promising method to enhance the code completion capabilities of LLMs, which leverages relevant context from codebases without requiring model retraining. While existing studies have demonstrated the effectiveness of RAG on public repositories and benchmarks, the potential distribution shift between open-source and closed-source codebases presents unique challenges that remain unexplored. To mitigate the gap, we conduct an empirical study to investigate the performance of widely-used RAG methods for code completion in the industrial-scale codebase of WeChat, one of the largest proprietary software systems. Specifically, we extensively explore two main types of RAG methods, namely identifier-based RAG and similarity-based RAG, across 26 open-source LLMs ranging from 0.5B to 671B parameters. For a more comprehensive analysis, we employ different retrieval techniques for similarity-based RAG, including lexical and semantic retrieval. Based on 1,669 internal repositories, we achieve several key findings: (1) both RAG methods demonstrate effectiveness in closed-source repositories, with similarity-based RAG showing superior performance, (2) the effectiveness of similarity-based RAG improves with more advanced retrieval techniques, where BM25 (lexical retrieval) and GTE-Qwen (semantic retrieval) achieve superior performance, and (3) the combination of lexical and semantic retrieval techniques yields optimal results, demonstrating complementary strengths. Furthermore, we conduct a developer survey to validate the practical utility of RAG methods in real-world development environments.
中文: 检索增强生成方法显著提升了微信等闭源工业代码库的代码补全能力,其中基于相似性的方法优于基于标识符的方法,而结合词汇与语义检索的策略效果最佳。
English: Retrieval-augmented generation (RAG) methods significantly enhance code completion in closed-source industrial codebases like WeChat, with similarity-based RAG outperforming identifier-based approaches and combined lexical-semantic retrieval yielding optimal results.
Authors:Minxi Ouyang, Lianghui Zhu, Yaqing Bao, Qiang Huang, Jingli Ouyang, Tian Guan, Xitong Ling, Jiawen Li, Song Duan, Wenbin Dai, Li Zheng, Xuemei Zhang, Yonghong He
Abstract:
Multimodal large models have shown great potential in automating pathology image analysis. However, current multimodal models for gastrointestinal pathology are constrained by both data quality and reasoning transparency: pervasive noise and incomplete annotations in public datasets predispose vision language models to factual hallucinations when generating diagnostic text, while the absence of explicit intermediate reasoning chains renders the outputs difficult to audit and thus less trustworthy in clinical practice. To address these issues, we construct a large scale gastrointestinal pathology dataset containing both microscopic descriptions and diagnostic conclusions, and propose a prompt argumentation strategy that incorporates lesion classification and anatomical site information. This design guides the model to better capture image specific features and maintain semantic consistency in generation. Furthermore, we employ a post training pipeline that combines supervised fine tuning with Group Relative Policy Optimization (GRPO) to improve reasoning quality and output structure. Experimental results on real world pathology report generation tasks demonstrate that our approach significantly outperforms state of the art open source and proprietary baselines in terms of generation quality, structural completeness, and clinical relevance. Our solution outperforms state of the art models with 18.7% higher clinical relevance, 32.4% improved structural completeness, and 41.2% fewer diagnostic errors, demonstrating superior accuracy and clinical utility compared to existing solutions.
中文: 本研究提出了一种新型多模态胃肠道病理分析框架,通过整合大规模标注数据集、提示论证策略和GRPO强化训练,有效解决了数据噪声和推理不透明问题,在诊断准确性和临床相关性上显著优于现有模型。
English: This study introduces a novel multimodal framework for gastrointestinal pathology analysis that overcomes data noise and reasoning opacity by integrating a large-scale annotated dataset with a prompt argumentation strategy and GRPO-enhanced training, achieving superior diagnostic accuracy and clinical relevance over existing models.
Authors:Haonan An, Guang Hua, Hangcheng Cao, Zhengru Fang, Guowen Xu, Susanto Rahardja, Yuguang Fang
Abstract:
The intellectual property of deep generative networks (GNets) can be protected using a cascaded hiding network (HNet) which embeds watermarks (or marks) into GNet outputs, known as box-free watermarking. Although both GNet and HNet are encapsulated in a black box (called operation network, or ONet), with only the generated and marked outputs from HNet being released to end users and deemed secure, in this paper, we reveal an overlooked vulnerability in such systems. Specifically, we show that the hidden GNet outputs can still be reliably estimated via query-based reverse engineering, leaking the generated and unmarked images, despite the attacker's limited knowledge of the system. Our first attempt is to reverse-engineer an inverse model for HNet under the stringent black-box condition, for which we propose to exploit the query process with specially curated input images. While effective, this method yields unsatisfactory image quality. To improve this, we subsequently propose an alternative method leveraging the equivalent additive property of box-free model watermarking and reverse-engineering a forward surrogate model of HNet, with better image quality preservation. Extensive experimental results on image processing and image generation tasks demonstrate that both attacks achieve impressive watermark removal success rates (100%) while also maintaining excellent image quality (reaching the highest PSNR of 34.69 dB), substantially outperforming existing attacks, highlighting the urgent need for robust defensive strategies to mitigate the identified vulnerability in box-free model watermarking.
Chinese: 本文揭示了深度生成网络无盒水印技术中的安全漏洞,表明攻击者能够在保持图像高质量的同时有效逆向工程并移除水印,强调了开发更强大防御措施的迫切性。
English: This paper exposes a vulnerability in box-free watermarking for deep generative networks, demonstrating that attackers can effectively reverse-engineer and remove watermarks while maintaining high image quality, thus highlighting the need for stronger defenses.
Authors:Taha Ceritli, Ondrej Bohdal, Mete Ozay, Jijoong Moon, Kyeng-Hun Lee, Hyeonmok Ko, Umberto Michieli
Abstract:
Large language models (LLMs) often leverage adapters, such as low-rank-based adapters, to achieve strong performance on downstream tasks. However, storing a separate adapter for each task significantly increases memory requirements, posing a challenge for resource-constrained environments such as mobile devices. Although model merging techniques can reduce storage costs, they typically result in substantial performance degradation. In this work, we introduce HydraOpt, a new model merging technique that capitalizes on the inherent similarities between the matrices of low-rank adapters. Unlike existing methods that produce a fixed trade-off between storage size and performance, HydraOpt allows us to navigate this spectrum of efficiency and performance. Our experiments show that HydraOpt significantly reduces storage size (48% reduction) compared to storing all adapters, while achieving competitive performance (0.2-1.8% drop). Furthermore, it outperforms existing merging techniques in terms of performance at the same or slightly worse storage efficiency.
中文: HydraOpt是一种新颖的模型融合技术,通过利用低秩适配器矩阵间的相似性,在保持性能仅下降0.2-1.8%的同时显著减少48%的存储需求,在效率与性能的平衡上优于现有方法。
English: HydraOpt is a novel model merging technique that leverages the similarities between low-rank adapter matrices to significantly reduce storage requirements by 48% while maintaining competitive performance with only a 0.2-1.8% drop, outperforming existing methods in balancing efficiency and effectiveness.
Authors:Yi Guo, Wei Wang, Zhihang Yuan, Rong Cao, Kuan Chen, Zhengyang Chen, Yuanyuan Huo, Yang Zhang, Yuping Wang, Shouda Liu, Yuxuan Wang
Abstract:
Generative models like Flow Matching have achieved state-of-the-art performance but are often hindered by a computationally expensive iterative sampling process. To address this, recent work has focused on few-step or one-step generation by learning the average velocity field, which directly maps noise to data. MeanFlow, a leading method in this area, learns this field by enforcing a differential identity that connects the average and instantaneous velocities. In this work, we argue that this differential formulation is a limiting special case of a more fundamental principle. We return to the first principles of average velocity and leverage the additivity property of definite integrals. This leads us to derive a novel, purely algebraic identity we term Interval Splitting Consistency. This identity establishes a self-referential relationship for the average velocity field across different time intervals without resorting to any differential operators. Based on this principle, we introduce SplitMeanFlow, a new training framework that enforces this algebraic consistency directly as a learning objective. We formally prove that the differential identity at the core of MeanFlow is recovered by taking the limit of our algebraic consistency as the interval split becomes infinitesimal. This establishes SplitMeanFlow as a direct and more general foundation for learning average velocity fields. From a practical standpoint, our algebraic approach is significantly more efficient, as it eliminates the need for JVP computations, resulting in simpler implementation, more stable training, and broader hardware compatibility. One-step and two-step SplitMeanFlow models have been successfully deployed in large-scale speech synthesis products (such as Doubao), achieving speedups of 20x.
中文摘要:SplitMeanFlow提出了一种代数化的区间分割一致性原理,将MeanFlow的微分方法推广为更通用的基础框架,通过消除JVP计算实现了更高效的训练,并在语音合成等实际应用中获得了20倍的加速效果。
English Summary: SplitMeanFlow introduces an algebraic Interval Splitting Consistency principle that generalizes MeanFlow's differential approach, enabling more efficient and stable training of one-step generative models while achieving 20x speedup in real-world applications like speech synthesis.
Authors:Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O'Brien, James Glass
Abstract:
To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.
中文: 线程推理模型(TIM)及其运行时TIMRUN通过将推理构建为可重用内存的树状结构,使大语言模型突破上下文限制,在复杂任务中实现高效准确的推理。
English: The Thread Inference Model (TIM) and its runtime TIMRUN enable large language models to overcome context limitations by structuring reasoning into trees with reusable memory, achieving high efficiency and accuracy in complex tasks.
Authors:Karoline Nylænder, Aitor Arrieta, Shaukat Ali, Paolo Arcaini
Abstract:
Self-adaptation in maritime autonomous vessels (AVs) enables them to adapt their behaviors to address unexpected situations while maintaining dependability requirements. During the design of such AVs, it is crucial to understand and identify the settings that should trigger adaptations, enabling validation of their implementation. To this end, we focus on the navigation software of AVs, which must adapt their behavior during operation through adaptations. AVs often rely on predefined waypoints to guide them along designated routes, ensuring safe navigation. We propose a multiobjective search-based approach, called WPgen, to generate minor modifications to the predefined set of waypoints, keeping them as close as possible to the original waypoints, while causing the AV to navigate inappropriately when navigating with the generated waypoints. WPgen uses NSGA-II as the multi-objective search algorithm with three seeding strategies for its initial population, resulting in three variations of WPgen. We evaluated these variations on three AVs (one overwater tanker and two underwater). We compared the three variations of WPgen with Random Search as the baseline and with each other. Experimental results showed that the effectiveness of these variations varied depending on the AV. Based on the results, we present the research and practical implications of WPgen.
中文: 本研究提出WPgen这一多目标搜索方法,通过对预设航路点进行微调来测试海上自主船舶的适应性导航行为,其有效性因船舶类型而异。
English: The study introduces WPgen, a multiobjective search-based method that generates slight modifications to predefined waypoints in maritime autonomous vessels to test their adaptive navigation behavior, with its effectiveness varying across different vessel types.
Authors:Paul R. B. Houssel, Siamak Layeghy, Priyanka Singh, Marius Portmann
Abstract:
This paper introduces eX-NIDS, a framework designed to enhance interpretability in flow-based Network Intrusion Detection Systems (NIDS) by leveraging Large Language Models (LLMs). In our proposed framework, flows labelled as malicious by NIDS are initially processed through a module called the Prompt Augmenter. This module extracts contextual information and Cyber Threat Intelligence (CTI)-related knowledge from these flows. This enriched, context-specific data is then integrated with an input prompt for an LLM, enabling it to generate detailed explanations and interpretations of why the flow was identified as malicious by NIDS. We compare the generated interpretations against a Basic-Prompt Explainer baseline, which does not incorporate any contextual information into the LLM's input prompt. Our framework is quantitatively evaluated using the Llama 3 and GPT-4 models, employing a novel evaluation method tailored for natural language explanations, focusing on their correctness and consistency. The results demonstrate that augmented LLMs can produce accurate and consistent explanations, serving as valuable complementary tools in NIDS to explain the classification of malicious flows. The use of augmented prompts enhances performance by over 20% compared to the Basic-Prompt Explainer.
中文: 本文提出eX-NIDS框架,通过利用大语言模型增强基于流量的网络入侵检测系统的可解释性,结合上下文和网络威胁情报为恶意流量分类生成详细解释,相比基础提示方法性能提升超过20%。
English: This paper presents eX-NIDS, a framework that improves the interpretability of flow-based Network Intrusion Detection Systems by using Large Language Models to generate detailed explanations for malicious flow classifications, achieving over 20% better performance than basic prompts through enriched context and Cyber Threat Intelligence integration.
Authors:Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Chen Huang, Kejia Huang, Yao Wan, Zhichao Hou, Xuming Hu
Abstract:
Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks, where two scenarios may appear: (i) an input sample is genuinely unverifiable, but the model cannot understand why; and (ii) a verifiable problem that the model fails to solve, thus outputs Unknown. We refer to these cases collectively as the Vague Perception phenomenon. Current evaluations focus on whether such answers are honest, rather than analyzing the limits of LLM reasoning.
To address this, we introduce WakenLLM, a framework that quantifies the portion of Unknown output attributable to model incapacity and evaluates whether stimulation can convert them into either correct answers (verifiable) or justified (unverifiable) responses with valid reasoning. Our method offers a clearer picture of the limits of LLM reasoning and the potential for corrections across various datasets. Comprehensive experiments on six LLMs suggest that, without any training or parameter revision, LLMs can achieve up to a 68.53% accuracy improvement on Vague Perception samples through guided understanding.
Our work reveals that current baseline methods only activate a small portion of LLMs' reasoning potential, indicating considerable unexplored capacity. This extends the theoretical upper bounds of reasoning accuracy in LLMs. Consequently, this study deepens our understanding of the latent reasoning capacity of LLMs and offers a new perspective on addressing the Vague Perception phenomenon.
中文: 本研究提出WakenLLM框架,用于量化和解决大语言模型中的模糊感知现象——即模型因问题真正不可验证或自身推理失败而输出“未知”,实验表明通过引导理解可在不重新训练模型的情况下显著提升推理准确率。
English: The study introduces WakenLLM, a framework that quantifies and addresses the Vague Perception phenomenon in LLMs, where models output Unknown due to either genuine unverifiability or reasoning failures, demonstrating through experiments that guided understanding can significantly improve accuracy without model retraining.
Authors:Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Chen Huang, Kejia Huang, Yao Wan, Zhichao Hou, Xuming Hu
Abstract:
Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks, where two scenarios may appear: (i) an input sample is genuinely unverifiable, but the model cannot understand why; and (ii) a verifiable problem that the model fails to solve, thus outputs Unknown. We refer to these cases collectively as the Vague Perception phenomenon. Current evaluations focus on whether such answers are honest, rather than analyzing the limits of LLM reasoning. To address this, we introduce WakenLLM, a framework that quantifies the portion of Unknown output attributable to model incapacity and evaluates whether stimulation can convert them into either correct answers (verifiable) or justified (unverifiable) responses with valid reasoning. Our method offers a clearer picture of the limits of LLM reasoning and the potential for corrections across various datasets. Comprehensive experiments on six LLMs suggest that, without any training or parameter revision, LLMs can achieve up to a 68.53% accuracy improvement on Vague Perception samples through guided understanding. Our work reveals that current baseline methods only activate a small portion of LLMs' reasoning potential, indicating considerable unexplored capacity. This extends the theoretical upper bounds of reasoning accuracy in LLMs. Consequently, this study deepens our understanding of the latent reasoning capacity of LLMs and offers a new perspective on addressing the Vague Perception phenomenon.
中文: 本研究提出WakenLLM框架,用于量化和解决大语言模型中的模糊感知现象——即模型因问题真正不可验证或自身推理失败而输出“未知”,实验表明通过引导理解可在不重新训练模型的情况下显著提升推理准确率。
English: The study introduces WakenLLM, a framework that quantifies and addresses the Vague Perception phenomenon in LLMs, where models output Unknown due to either genuine unverifiability or reasoning failures, demonstrating through experiments that guided understanding can significantly improve accuracy without model retraining.
Authors:Ondrej Bohdal, Mete Ozay, Jijoong Moon, Kyeng-Hun Lee, Hyeonmok Ko, Umberto Michieli
Abstract:
Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.
中文: 本文针对设备端大语言模型提出了一种组合多任务基准和高效的“可学习校准”方法,实现在资源受限环境下同时执行翻译与摘要等任务。
English: This paper introduces a benchmark and an efficient Learnable Calibration method for compositional multi-tasking in on-device LLMs, enabling concurrent execution of tasks like translation and summarization under resource constraints.
Authors:Nicolas Lanzetti, Sylvain Fricker, Saverio Bolognani, Florian Dörfler, Dario Paccagnan
Abstract:
In many game-theoretic settings, agents are challenged with taking decisions against the uncertain behavior exhibited by others. Often, this uncertainty arises from multiple sources, e.g., incomplete information, limited computation, bounded rationality. While it may be possible to guide the agents' decisions by modeling each source, their joint presence makes this task particularly daunting. Toward this goal, it is natural for agents to seek protection against deviations around the emergent behavior itself, which is ultimately impacted by all the above sources of uncertainty. To do so, we propose that each agent takes decisions in face of the worst-case behavior contained in an ambiguity set of tunable size, centered at the emergent behavior so implicitly defined. This gives rise to a novel equilibrium notion, which we call strategically robust equilibrium. Building on its definition, we show that, when judiciously operationalized via optimal transport, strategically robust equilibria (i) are guaranteed to exist under the same assumptions required for Nash equilibria; (ii) interpolate between Nash and security strategies; (iii) come at no additional computational cost compared to Nash equilibria. Through a variety of experiments, including bi-matrix games, congestion games, and Cournot competition, we show that strategic robustness protects against uncertainty in the opponents' behavior and, surprisingly, often results in higher equilibrium payoffs - an effect we refer to as coordination via robustification.
This paper introduces a strategically robust equilibrium concept where agents make decisions against worst-case behaviors within an adjustable ambiguity set centered on emergent behavior, ensuring existence under Nash equilibrium assumptions while providing protection against behavioral uncertainty and often yielding higher payoffs.
English Summary:
Authors:Lucian Cristian Iacob, Roland Tóth, Maarten Schoukens
Abstract:
The challenge of finding exact and finite-dimensional Koopman embeddings of nonlinear systems has been largely circumvented by employing data-driven techniques to learn models of different complexities (e.g., linear, bilinear, input affine). Although these models may provide good accuracy, selecting the model structure and dimension is still ad-hoc and it is difficult to quantify the error that is introduced. In contrast to the general trend of data-driven learning, in this paper, we develop a systematic technique for nonlinear systems that produces a finite-dimensional and exact embedding. If the nonlinear system is represented as a network of series and parallel linear and nonlinear (polynomial) blocks, one can derive an associated Koopman model that has constant state and output matrices and the input influence is polynomial. Furthermore, if the linear blocks do not have feedthrough, the Koopman representation simplifies to a bilinear model.
Chinese: 本文提出了一种系统方法,可从由线性和多项式模块组成的非线性系统中推导出精确的有限维Koopman嵌入,与依赖随意模型选择且难以量化误差的数据驱动方法形成鲜明对比。
English: This paper introduces a systematic method for deriving exact, finite-dimensional Koopman embeddings from nonlinear systems structured as networks of linear and polynomial blocks, contrasting with data-driven approaches that often involve arbitrary model selection and unquantified errors.
Authors:Yuanhan Zhang, Yunice Chew, Yuhao Dong, Aria Leo, Bo Hu, Ziwei Liu
Abstract:
Human intelligence requires correctness and robustness, with the former being foundational for the latter. In video understanding, correctness ensures the accurate interpretation of visual content, and robustness maintains consistent performance in challenging conditions. Despite advances in video large language models (video LLMs), existing benchmarks inadequately reflect the gap between these models and human intelligence in maintaining correctness and robustness in video interpretation. We introduce the Video Thinking Test (Video-TT), to assess if video LLMs can interpret real-world videos as effectively as humans. Video-TT reflects genuine gaps in understanding complex visual narratives, and evaluates robustness against natural adversarial questions. Video-TT comprises 1,000 YouTube Shorts videos, each with one open-ended question and four adversarial questions that probe visual and narrative complexity. Our evaluation shows a significant gap between video LLMs and human performance.
中文: 视频思维测试(Video-TT)旨在评估视频大语言模型在理解复杂视觉叙事和应对对抗性问题时的表现,结果显示其与人类能力存在显著差距。
English: The Video Thinking Test (Video-TT) is introduced to evaluate video large language models' ability to interpret complex visual narratives and maintain robustness against adversarial questions, revealing a significant performance gap compared to humans.
Authors:Ruoyu Su, Noman Ahmad, Matteo Esposito, Andrea Janes, Davide Taibi, Valentina Lenarduzzi
Abstract:
Software architecture plays a central role in the design, development, and maintenance of software systems. With the rise of cloud computing, microservices, and containers, architectural practices have diversified. Understanding these shifts is vital. This study analyzes software architecture trends across eight leading industry conferences over five years. We investigate the evolution of software architecture by analyzing talks from top practitioner conferences, focusing on the motivations and contexts driving technology adoption. We analyzed 5,677 talks from eight major industry conferences, using large language models and expert validation to extract technologies, their purposes, and usage contexts. We also explored how technologies interrelate and fit within DevOps and deployment pipelines. Among 450 technologies, Kubernetes, Cloud Native, Serverless, and Containers dominate by frequency and centrality. Practitioners present technology mainly related to deployment, communication, AI, and observability. We identify five technology communities covering automation, coordination, cloud AI, monitoring, and cloud-edge. Most technologies span multiple DevOps stages and support hybrid deployment. Our study reveals that a few core technologies, like Kubernetes and Serverless, dominate the contemporary software architecture practice. These are mainly applied in later DevOps stages, with limited focus on early phases like planning and coding. We also show how practitioners frame technologies by purpose and context, reflecting evolving industry priorities. Finally, we observe how only research can provide a more holistic lens on architectural design, quality, and evolution.
中文摘要:本研究通过分析5,677个会议演讲发现,以Kubernetes和无服务器架构为核心的技术主导着现代软件架构实践,这些技术主要应用于DevOps后期阶段,对早期开发环节关注有限。
English Summary: This study analyzes software architecture trends from 5,677 conference talks, revealing that core technologies like Kubernetes and Serverless dominate modern practices, primarily applied in later DevOps stages with limited early-phase focus.
Authors:Noman Ahmad, Ruoyu Su, Matteo Esposito, Andrea Janes, Valentina Lenarduzzi, Davide Taibi
Abstract:
Architectural degradation, also known as erosion, decay, or aging, impacts system quality, maintainability, and adaptability. Although widely acknowledged, current literature shows fragmented definitions, metrics, and remediation strategies. Our study aims to unify understanding of architectural degradation by identifying its definitions, causes, metrics, tools, and remediation approaches across academic and gray literature. We conducted a multivocal literature review of 108 studies extracting definitions, causes, metrics, measurement approaches, tools, and remediation strategies. We developed a taxonomy encompassing architectural, code, and process debt to explore definition evolution, methodological trends, and research gaps. Architectural degradation has shifted from a low-level issue to a socio-technical concern. Definitions now address code violations, design drift, and structural decay. Causes fall under architectural (e.g., poor documentation), code (e.g., hasty fixes), and process debt (e.g., knowledge loss). We identified 54 metrics and 31 measurement techniques, focused on smells, cohesion/coupling, and evolution. Yet, most tools detect issues but rarely support ongoing or preventive remediation. Degradation is both technical and organizational. While detection is well-studied, continuous remediation remains lacking. Our study reveals missed integration between metrics, tools, and repair logic, urging holistic, proactive strategies for sustainable architecture.
中文摘要:本研究通过文献综述整合了架构退化研究,揭示其已发展为涵盖技术与社会因素的问题,虽已识别成因、度量方法和工具,但在持续修复策略方面仍存在明显不足,与成熟的检测能力形成反差。
English Summary: This study synthesizes architectural degradation research through a literature review, revealing its evolution into a socio-technical issue with identified causes, metrics, and tools, yet highlighting a critical gap in continuous remediation strategies despite extensive detection capabilities.
Authors:Junho Choi, Kihwan Ryoo, Jeewon Kim, Taeyun Kim, Eungchang Lee, Myeongwoo Jeong, Kevin Christiansen Marsim, Hyungtae Lim, Hyun Myung
Abstract:
Multi-robot localization is a crucial task for implementing multi-robot systems. Numerous researchers have proposed optimization-based multi-robot localization methods that use camera, IMU, and UWB sensors. Nevertheless, characteristics of individual robot odometry estimates and distance measurements between robots used in the optimization are not sufficiently considered. In addition, previous researches were heavily influenced by the odometry accuracy that is estimated from individual robots. Consequently, long-term drift error caused by error accumulation is potentially inevitable. In this paper, we propose a novel visual-inertial-range-based multi-robot localization method, named SaWa-ML, which enables geometric structure-aware pose correction and weight adaptation-based robust multi-robot localization. Our contributions are twofold: (i) we leverage UWB sensor data, whose range error does not accumulate over time, to first estimate the relative positions between robots and then correct the positions of each robot, thus reducing long-term drift errors, (ii) we design adaptive weights for robot pose correction by considering the characteristics of the sensor data and visual-inertial odometry estimates. The proposed method has been validated in real-world experiments, showing a substantial performance increase compared with state-of-the-art algorithms.
中文: 本文提出了一种名为SaWa-ML的新型多机器人定位方法,通过视觉-惯性-距离数据实现位姿校正和自适应权重调整,有效减少长期漂移误差,并在实际实验中显著优于现有算法。
English: This paper introduces SaWa-ML, a novel multi-robot localization method that uses visual-inertial-range data to correct poses and adapt weights, effectively reducing long-term drift errors and outperforming existing algorithms in real-world tests.
Authors:Shihao Wang, Guo Chen, De-an Huang, Zhiqi Li, Minghan Li, Guilin Li, Jose M. Alvarez, Lei Zhang, Zhiding Yu
Abstract:
Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, substantially adopt unsupervised learning paradigms, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.
中文: 提出的VideoITG方法通过指令引导的VidThinker框架选择关键视频帧,构建大规模标注数据集,并以判别式采样提升视频大语言模型在多项理解任务中的性能表现。
English: The proposed VideoITG method enhances Video-LLMs by using an instruction-guided frame selection pipeline called VidThinker, which creates a large annotated dataset and improves performance on video understanding benchmarks through discriminative frame sampling.
Authors:Chao Feng, Alberto Huertas Celdran, Jing Han, Heqing Ren, Xi Cheng, Zien Zeng, Lucas Krauter, Gerome Bovet, Burkhard Stiller
Abstract:
This paper introduces a dataset and experimental study for decentralized federated learning (DFL) applied to IoT crowdsensing malware detection. The dataset comprises behavioral records from benign and eight malware families. A total of 21,582,484 original records were collected from system calls, file system activities, resource usage, kernel events, input/output events, and network records. These records were aggregated into 30-second windows, resulting in 342,106 features used for model training and evaluation. Experiments on the DFL platform compare traditional machine learning (ML), centralized federated learning (CFL), and DFL across different node counts, topologies, and data distributions. Results show that DFL maintains competitive performance while preserving data locality, outperforming CFL in most settings. This dataset provides a solid foundation for studying the security of IoT crowdsensing environments.
中文摘要:本文提出了用于物联网恶意软件检测的去中心化联邦学习数据集和实验框架,研究表明DFL在保持数据本地化的同时,在多数场景下性能优于中心化联邦学习方案。
English Summary: This paper presents a decentralized federated learning dataset and framework for IoT malware detection, demonstrating that DFL achieves competitive performance while preserving data privacy across various network conditions.
Authors:Wangzheng Shi, Yinglin Zheng, Yuxin Lin, Jianmin Bao, Ming Zeng, Dong Chen
Abstract:
Hair transfer is increasingly valuable across domains such as social media, gaming, advertising, and entertainment. While significant progress has been made in single-image hair transfer, video-based hair transfer remains challenging due to the need for temporal consistency, spatial fidelity, and dynamic adaptability. In this work, we propose HairShifter, a novel "Anchor Frame + Animation" framework that unifies high-quality image hair transfer with smooth and coherent video animation. At its core, HairShifter integrates a Image Hair Transfer (IHT) module for precise per-frame transformation and a Multi-Scale Gated SPADE Decoder to ensure seamless spatial blending and temporal coherence. Our method maintains hairstyle fidelity across frames while preserving non-hair regions. Extensive experiments demonstrate that HairShifter achieves state-of-the-art performance in video hairstyle transfer, combining superior visual quality, temporal consistency, and scalability. The code will be publicly available. We believe this work will open new avenues for video-based hairstyle transfer and establish a robust baseline in this field.
中文: HairShifter提出创新的"锚定帧+动画"框架,将高质量图像发型迁移与流畅视频动画相统一,在视频发型迁移中实现了时序一致性与视觉保真度的最优性能。
English: HairShifter introduces an innovative "Anchor Frame + Animation" framework that unifies high-quality image hair transfer with smooth video animation, achieving state-of-the-art performance in temporal consistency and visual fidelity for video hairstyle transfer.
Authors:Manuel Barusco, Francesco Borsatti, Arianna Stropeni, Davide Dalle Pezze, Gian Antonio Susto
Abstract:
VAD is a critical field in machine learning focused on identifying deviations from normal patterns in images, often challenged by the scarcity of anomalous data and the need for unsupervised training. To accelerate research and deployment in this domain, we introduce MoViAD, a comprehensive and highly modular library designed to provide fast and easy access to state-of-the-art VAD models, trainers, datasets, and VAD utilities. MoViAD supports a wide array of scenarios, including continual, semi-supervised, few-shots, noisy, and many more. In addition, it addresses practical deployment challenges through dedicated Edge and IoT settings, offering optimized models and backbones, along with quantization and compression utilities for efficient on-device execution and distributed inference. MoViAD integrates a selection of backbones, robust evaluation VAD metrics (pixel-level and image-level) and useful profiling tools for efficiency analysis. The library is designed for fast, effortless deployment, enabling machine learning engineers to easily use it for their specific setup with custom models, datasets, and backbones. At the same time, it offers the flexibility and extensibility researchers need to develop and experiment with new methods.
中文: MoViAD是一个模块化库,通过提供先进的模型、数据集和工具,加速视频异常检测的研究与部署,同时支持多种场景并解决边缘和物联网等实际部署挑战。
English: MoViAD is a modular library that accelerates video anomaly detection research and deployment by providing access to state-of-the-art models, datasets, and utilities, while supporting diverse scenarios and addressing practical challenges like Edge and IoT deployment.
Authors:Rongtao Xu, Han Gao, Mingming Yu, Dong An, Shunpeng Chen, Changwei Wang, Li Guo, Xiaodan Liang, Shibiao Xu
Abstract:
With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate that 3D-MoRe significantly outperforms state-of-the-art baselines, with the CIDEr score improving by 2.15\%. Similarly, on ScanRefer, our approach achieves a notable increase in CIDEr@0.5 by 1.84\%, highlighting its effectiveness in both tasks. Our code and generated datasets will be publicly released to benefit the community, and both can be accessed on the https://3D-MoRe.github.io.
中文:3D-MoRe框架利用基础模型生成大规模3D语言数据集,在ScanNet数据集上的问答和密集描述任务中显著超越了现有最优基准。
English: The 3D-MoRe framework generates large-scale 3D-language datasets using foundational models, significantly outperforming state-of-the-art baselines in tasks like question answering and dense captioning on the ScanNet dataset.
Authors:Robby Hoover, Nelly Elsayed, Zag ElSayed, Chengcheng Li
Abstract:
Medical Imagings are considered one of the crucial diagnostic tools for different bones-related diseases, especially bones fractures. This paper investigates the robustness of pre-trained deep learning models for classifying bone fractures in X-ray images and seeks to address global healthcare disparity through the lens of technology. Three deep learning models have been tested under varying simulated equipment quality conditions. ResNet50, VGG16 and EfficientNetv2 are the three pre-trained architectures which are compared. These models were used to perform bone fracture classification as images were progressively degraded using noise. This paper specifically empirically studies how the noise can affect the bone fractures detection and how the pre-trained models performance can be changes due to the noise that affect the quality of the X-ray images. This paper aims to help replicate real world challenges experienced by medical imaging technicians across the world. Thus, this paper establishes a methodological framework for assessing AI model degradation using transfer learning and controlled noise augmentation. The findings provide practical insight into how robust and generalizable different pre-trained deep learning powered computer vision models can be when used in different contexts.
中文摘要:本研究通过模拟噪声条件评估了ResNet50、VGG16和EfficientNetv2三种预训练深度学习模型在X射线图像中骨折分类的鲁棒性,旨在通过测试模型性能退化建立应对现实医疗挑战的评估框架,以解决全球医疗资源不均问题。
English Summary: This study evaluates the robustness of three pre-trained deep learning models—ResNet50, VGG16, and EfficientNetv2—for classifying bone fractures in X-ray images under simulated noise conditions, aiming to address healthcare disparities by testing model performance degradation and establishing a framework for real-world application assessment.
Authors:Lucian Cristian Iacob, Máté Szécsi, Gerben Izaak Beintema, Maarten Schoukens, Roland Tóth
Abstract:
This paper presents a novel identification approach of Koopman models of nonlinear systems with inputs under rather general noise conditions. The method uses deep state-space encoders based on the concept of state reconstructability and an efficient multiple-shooting formulation of the squared loss of the prediction error to estimate the dynamics and the lifted state from input-output data. Furthermore, the Koopman model structure includes an innovation noise term that is used to handle process and measurement noise. It is shown that the proposed approach is statistically consistent and computationally efficient due to the multiple-shooting formulation where, on subsections of the data, multi-step prediction errors can be calculated in parallel. The latter allows for efficient batch optimization of the network parameters and, at the same time, excellent long-term prediction capabilities of the obtained models. The performance of the approach is illustrated by nonlinear benchmark examples.
中文: 本文提出了一种新颖的含输入非线性系统的Koopman模型辨识方法,采用基于状态可重构性的深度状态空间编码器和多重打靶法,在通用噪声条件下实现了统计一致性与计算效率。
English: This paper introduces a novel Koopman model identification method for nonlinear systems with inputs, utilizing deep state-space encoders and a multiple-shooting formulation to achieve statistical consistency and computational efficiency under general noise conditions.
Authors:Shota Horiguchi, Naohiro Tawara, Takanori Ashihara, Atsushi Ando, Marc Delcroix
Abstract:
Neural speaker diarization is widely used for overlap-aware speaker diarization, but it requires large multi-speaker datasets for training. To meet this data requirement, large datasets are often constructed by combining multiple corpora, including those originally designed for multi-speaker automatic speech recognition (ASR). However, ASR datasets often feature loosely defined segment boundaries that do not align with the stricter conventions of diarization benchmarks. In this work, we show that such boundary looseness significantly impacts the diarization error rate, reducing evaluation reliability. We also reveal that models trained on data with varying boundary precision tend to learn dataset-specific looseness, leading to poor generalization across out-of-domain datasets. Training with standardized tight boundaries via forced alignment improves not only diarization performance, especially in streaming scenarios, but also ASR performance when combined with simple post-processing.
中文: 神经说话人日志系统需要大量多人语音数据进行训练,但使用边界松散的自动语音识别数据集会影响评估可靠性和模型泛化能力,而通过强制对齐采用标准化紧边界训练可有效改善性能。
English: Neural speaker diarization requires large multi-speaker datasets for training, but using ASR datasets with loose segment boundaries harms evaluation reliability and model generalization, which can be improved by training with standardized tight boundaries via forced alignment.
Authors:Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Yuqi Ren, Deyi Xiong
Abstract:
Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model to enhance its generative capabilities in Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that our model consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.
中文: 研究人员构建了迄今最大的藏语预训练语料库,并通过增强多语言模型,使其在多种藏语任务上显著超越现有模型。
English: Researchers have compiled the largest Tibetan pre-training corpus to date and enhanced a multilingual model, which significantly outperforms existing models across various Tibetan language tasks.
Authors:Wuxin Wang, Weicheng Ni, Lilan Huang, Tao Hao, Ben Fei, Shuo Ma, Taikang Yuan, Yanlai Zhao, Kefeng Deng, Xiaoyong Li, Boheng Duan, Lei Bai, Kaijun Ren
Abstract:
Recent advancements in Artificial Intelligence (AI) demonstrate significant potential to revolutionize weather forecasting. However, most AI-driven models rely on Numerical Weather Prediction (NWP) systems for initial condition preparation, which often consumes hours on supercomputers. Here we introduce XiChen, the first observation-scalable fully AI-driven global weather forecasting system, whose entire pipeline, from Data Assimilation (DA) to medium-range forecasting, can be accomplished within only 17 seconds. XiChen is built upon a foundation model that is pre-trained for weather forecasting. Meanwhile, this model is subsequently fine-tuned to serve as both observation operators and DA models, thereby scalably assimilating conventional and raw satellite observations. Furthermore, the integration of four-dimensional variational knowledge ensures that XiChen's DA and medium-range forecasting accuracy rivals that of operational NWP systems, amazingly achieving a skillful forecasting lead time exceeding 8.25 days. These findings demonstrate that XiChen holds strong potential toward fully AI-driven weather forecasting independent of NWP systems.
中文摘要:曦辰是全球首个全AI驱动的天气预测系统,仅用17秒即可完成从数据同化到中期预报的全流程,其预报精度媲美业务数值天气预报系统,且有效预报时长突破8.25天。
English Summary: XiChen is the first fully AI-driven global weather forecasting system that completes data assimilation to medium-range forecasting in just 17 seconds, achieving accuracy comparable to operational NWP systems with over 8.25 days of skillful forecasting lead time.
Authors:Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, Boris Ginsburg
Abstract:
Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.
中文:OpenCodeReasoning-II数据集包含250万组问答-解决方案-评价三元组,通过两阶段微调策略训练的Qwen2.5-Instruct模型在代码生成方面达到最优性能,并结合代码评价功能显著提升了竞技编程表现。
English: The OpenCodeReasoning-II dataset, with 2.5M question-solution-critique triples, enables a two-stage fine-tuning approach that produces Qwen2.5-Instruct models achieving state-of-the-art code generation performance and enhanced competitive coding through integrated critique capabilities.
Authors:Marina Ceccon, Giandomenico Cornacchia, Davide Dalle Pezze, Alessandro Fabris, Gian Antonio Susto
Abstract:
Undesirable biases encoded in the data are key drivers of algorithmic discrimination. Their importance is widely recognized in the algorithmic fairness literature, as well as legislation and standards on anti-discrimination in AI. Despite this recognition, data biases remain understudied, hindering the development of computational best practices for their detection and mitigation. In this work, we present three common data biases and study their individual and joint effect on algorithmic discrimination across a variety of datasets, models, and fairness measures. We find that underrepresentation of vulnerable populations in training sets is less conducive to discrimination than conventionally affirmed, while combinations of proxies and label bias can be far more critical. Consequently, we develop dedicated mechanisms to detect specific types of bias, and combine them into a preliminary construct we refer to as the Data Bias Profile (DBP). This initial formulation serves as a proof of concept for how different bias signals can be systematically documented. Through a case study with popular fairness datasets, we demonstrate the effectiveness of the DBP in predicting the risk of discriminatory outcomes and the utility of fairness-enhancing interventions. Overall, this article bridges algorithmic fairness research and anti-discrimination policy through a data-centric lens.
中文: 本研究识别了三种常见数据偏见,发现代理偏见与标签偏见结合比弱势群体代表性不足更具歧视风险,并提出数据偏见档案以系统检测偏见并预测歧视性结果。
English: This study identifies three common data biases, revealing that proxy and label biases combined pose greater discrimination risks than underrepresented groups, and introduces a Data Bias Profile to systematically detect biases and predict discriminatory outcomes.
Authors:Haitao Lin, Junjie Wang, Zhifeng Gao, Xiaohong Ji, Rong Zhu, Linfeng Zhang, Guolin Ke, Weinan E
Abstract:
The essence of a chemical reaction lies in the redistribution and reorganization of electrons, which is often manifested through electron transfer or the migration of electron pairs. These changes are inherently discrete and abrupt in the physical world, such as alterations in the charge states of atoms or the formation and breaking of chemical bonds. To model the transition of states, we propose SynBridge, a bidirectional flow-based generative model to achieve multi-task reaction prediction. By leveraging a graph-to-graph transformer network architecture and discrete flow bridges between any two discrete distributions, SynBridge captures bidirectional chemical transformations between graphs of reactants and products through the bonds' and atoms' discrete states. We further demonstrate the effectiveness of our method through extensive experiments on three benchmark datasets (USPTO-50K, USPTO-MIT, Pistachio), achieving state-of-the-art performance in both forward and retrosynthesis tasks. Our ablation studies and noise scheduling analysis reveal the benefits of structured diffusion over discrete spaces for reaction prediction.
中文: SynBridge是一种基于双向流的生成模型,通过图到图的转换网络和离散流桥技术,在多个基准数据集上实现了前向和逆合成反应预测的最先进性能。
English: SynBridge is a bidirectional flow-based generative model that employs a graph-to-graph transformer and discrete flow bridges to achieve state-of-the-art performance in both forward and retrosynthesis reaction prediction across multiple benchmark datasets.
Authors:Mathias Zinnen, Prathmesh Madhu, Inger Leemans, Peter Bell, Azhar Hussian, Hang Tran, Ali HürriyetoÄlu, Andreas Maier, Vincent Christlein
Abstract:
Real-world applications of computer vision in the humanities require algorithms to be robust against artistic abstraction, peripheral objects, and subtle differences between fine-grained target classes. Existing datasets provide instance-level annotations on artworks but are generally biased towards the image centre and limited with regard to detailed object classes. The proposed ODOR dataset fills this gap, offering 38,116 object-level annotations across 4712 images, spanning an extensive set of 139 fine-grained categories. Conducting a statistical analysis, we showcase challenging dataset properties, such as a detailed set of categories, dense and overlapping objects, and spatial distribution over the whole image canvas. Furthermore, we provide an extensive baseline analysis for object detection models and highlight the challenging properties of the dataset through a set of secondary studies. Inspiring further research on artwork object detection and broader visual cultural heritage studies, the dataset challenges researchers to explore the intersection of object recognition and smell perception.
中文摘要:ODOR数据集通过提供大量细粒度物体标注,弥补了现有艺术品数据集的不足,支持人文学科中稳健的计算机视觉应用,其复杂视觉特性对模型提出挑战,并推动物体识别与嗅觉感知交叉领域的研究。
English Summary: The ODOR dataset addresses limitations in existing artwork datasets by providing extensive, fine-grained object annotations to support robust computer vision applications in the humanities, challenging models with complex visual properties and encouraging research at the intersection of object recognition and olfactory perception.
Authors:Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Ryo Masumura
Abstract:
Single-channel speech enhancement is utilized in various tasks to mitigate the effect of interfering signals. Conventionally, to ensure the speech enhancement performs optimally, the speech enhancement has needed to be tuned for each task. Thus, generalizing speech enhancement models to unknown downstream tasks has been challenging. This study aims to construct a generic speech enhancement front-end that can improve the performance of back-ends to solve multiple downstream tasks. To this end, we propose a novel training criterion that minimizes the distance between the enhanced and the ground truth clean signal in the feature representation domain of self-supervised learning models. Since self-supervised learning feature representations effectively express high-level speech information useful for solving various downstream tasks, the proposal is expected to make speech enhancement models preserve such information. Experimental validation demonstrates that the proposal improves the performance of multiple speech tasks while maintaining the perceptual quality of the enhanced signal.
中文: 本研究通过最小化增强语音与纯净语音在自监督学习特征域中的距离,构建了一种通用语音增强前端,能在保持感知质量的同时提升多种下游任务的性能。
English: This study develops a universal speech enhancement front-end by minimizing the distance between enhanced and clean signals in self-supervised learning feature domains, improving performance across multiple downstream tasks while preserving perceptual quality.
Authors:Mihir Parmar, Palash Goyal, Xin Liu, Yiwen Song, Mingyang Ling, Chitta Baral, Hamid Palangi, Tomas Pfister
Abstract:
Recently, decomposing complex problems into simple subtasks--a crucial part of human-like natural planning--to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce PLAN-TUNING, a unified post-training framework that (i) distills synthetic task decompositions (termed "planning trajectories") from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average $\sim7\%$. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average $\sim10\%$ and $\sim12\%$ performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that PLAN-TUNING is an effective strategy for improving task-specific performance of smaller LLMs.
中文摘要:PLAN-TUNING是一种新颖的后训练框架,通过从大型模型提取规划轨迹并结合监督与强化学习微调小型模型,显著提升了数学推理能力并在跨领域数据集上展现出更强的泛化性能。
English Summary: PLAN-TUNING is a novel post-training framework that enhances smaller LLMs' complex reasoning by distilling planning trajectories from larger models and fine-tuning through supervised and reinforcement learning, achieving significant performance gains on mathematical benchmarks and improved generalization on out-of-domain datasets.
Authors:Pat Pataranutaporn, Nattavudh Powdthavee, Chayapatr Archiwaranguprok, Pattie Maes
Abstract:
Subjective well-being is a key metric in economic, medical, and policy decision-making. As artificial intelligence provides scalable tools for modelling human outcomes, it is crucial to evaluate whether large language models (LLMs) can accurately predict well-being across diverse global populations. We evaluate four leading LLMs using data from 64,000 individuals in 64 countries. While LLMs capture broad correlates such as income and health, their predictive accuracy decreases in countries underrepresented in the training data, highlighting systematic biases rooted in global digital and economic inequality. A pre-registered experiment demonstrates that LLMs rely on surface-level linguistic similarity rather than conceptual understanding, leading to systematic misestimations in unfamiliar or resource-limited settings. Injecting findings from underrepresented contexts substantially enhances performance, but a significant gap remains. These results highlight both the promise and limitations of LLMs in predicting global well-being, underscoring the importance of robust validation prior to their implementation across these areas.
Chinese: 大型语言模型在预测主观幸福感方面展现出潜力,但对数据代表性不足的群体存在系统性偏差,实际应用前需进行严格验证。
English: Large language models show promise in predicting subjective well-being but exhibit systematic biases against underrepresented populations, requiring robust validation before implementation.
Authors:Nelly Elsayed, Lily Dzamesi, Zag ElSayed, Murat Ozer
Abstract:
The Internet of Medical Things (IoMT) represents a paradigm shift in the healthcare sector, enabling the interconnection of medical devices, sensors, and systems to enhance patient monitoring, diagnosis, and management. The rapid evolution of IoMT presents significant benefits to the healthcare domains. However, there is a rapid increase in distributed denial of service (DDoS) attacks on the IoMT networks due to several vulnerabilities in the IoMT-connected devices, which negatively impact patients' health and can even lead to deaths. Thus, in this paper, we aim to save lives via investigating an extreme learning machine for detecting DDoS attacks on IoMT devices. The proposed approach achieves a high accuracy at a low implementation budget. Thus, it can reduce the implementation cost of the DDoS detection system, making the model capable of executing on the fog level.
中文: 医疗物联网(IoMT)通过设备互联提升医疗水平,但面临分布式拒绝服务攻击增多的威胁,本研究采用极限学习机开发低成本、高精度的攻击检测系统,以加强安全保障并挽救生命。
English: The Internet of Medical Things (IoMT) improves healthcare through interconnected devices but faces increasing DDoS attacks, prompting this study to develop a cost-effective, high-accuracy detection system using extreme learning machines to enhance security and save lives.
Authors:Junru Wu, Le Yan, Zhen Qin, Honglei Zhuang, Paul Suganthan G. C., Tianqi Liu, Zhe Dong, Xuanhui Wang, Harrie Oosterhuis
Abstract:
While Pairwise Ranking Prompting (PRP) with Large Language Models (LLMs) is one of the most effective zero-shot document ranking methods, it has a quadratic computational complexity with respect to the number of documents to be ranked, as it requires an enumeration over all possible document pairs. Consequently, the outstanding ranking performance of PRP has remained unreachable for most real-world ranking applications.
In this work, we propose to harness the effectiveness of PRP through pairwise distillation. Specifically, we distill a pointwise student ranker from pairwise teacher labels generated by PRP, resulting in an efficient student model that retains the performance of PRP with substantially lower computational costs. Furthermore, we find that the distillation process can be made sample-efficient: with only 2% of pairs, we are able to obtain the same performance as using all pairs for teacher labels. Thus, our novel approach provides a solution to harness the ranking performance of PRP without incurring high computational costs during both distillation and serving.
Chinese: 本研究提出了一种通过成对蒸馏的方法,将PRP(成对排序提示)的有效性转化为高效的点式学生排序器,在保持其优异排序性能的同时大幅降低了计算成本。
English: Pairwise Ranking Prompting (PRP) is a powerful but computationally expensive zero-shot document ranking method, and this work introduces a pairwise distillation approach to create an efficient pointwise student ranker that retains PRP's performance with significantly lower costs.
Authors:Jizhou Han, Shaokun Wang, Yuhang He, Chenhao Ding, Qiang Wang, Xinyuan Gao, SongLin Dong, Yihong Gong
Abstract:
Generalized Category Discovery (GCD) focuses on classifying known categories while simultaneously discovering novel categories from unlabeled data. However, previous GCD methods face challenges due to inconsistent optimization objectives and category confusion. This leads to feature overlap and ultimately hinders performance on novel categories. To address these issues, we propose the Neural Collapse-inspired Generalized Category Discovery (NC-GCD) framework. By pre-assigning and fixing Equiangular Tight Frame (ETF) prototypes, our method ensures an optimal geometric structure and a consistent optimization objective for both known and novel categories. We introduce a Consistent ETF Alignment Loss that unifies supervised and unsupervised ETF alignment and enhances category separability. Additionally, a Semantic Consistency Matcher (SCM) is designed to maintain stable and consistent label assignments across clustering iterations. Our method achieves strong performance on multiple GCD benchmarks, significantly enhancing novel category accuracy and demonstrating its effectiveness.
中文: 提出的NC-GCD框架通过固定ETF原型和一致性ETF对齐损失,解决了广义类别发现中的优化不一致问题,显著提升了新类别的识别性能。
English: The proposed NC-GCD framework addresses challenges in Generalized Category Discovery by utilizing fixed ETF prototypes and a Consistent ETF Alignment Loss to enhance category separability and achieve superior performance on novel categories.
Authors:Mingdong Wu, Lehong Wu, Yizhuo Wu, Weiyao Huang, Hongwei Fan, Zheyuan Hu, Haoran Geng, Jinzhou Li, Jiahe Ying, Long Yang, Yuanpei Chen, Hao Dong
Abstract:
Autonomous learning of dexterous, long-horizon robotic skills has been a longstanding pursuit of embodied AI. Recent advances in robotic reinforcement learning (RL) have demonstrated remarkable performance and robustness in real-world visuomotor control tasks. However, applying RL in the real world faces challenges such as low sample efficiency, slow exploration, and significant reliance on human intervention. In contrast, simulators offer a safe and efficient environment for extensive exploration and data collection, while the visual sim-to-real gap, often a limiting factor, can be mitigated using real-to-sim techniques. Building on these, we propose SimLauncher, a novel framework that combines the strengths of real-world RL and real-to-sim-to-real approaches to overcome these challenges. Specifically, we first pre-train a visuomotor policy in the digital twin simulation environment, which then benefits real-world RL in two ways: (1) bootstrapping target values using extensive simulated demonstrations and real-world demonstrations derived from pre-trained policy rollouts, and (2) Incorporating action proposals from the pre-trained policy for better exploration. We conduct comprehensive experiments across multi-stage, contact-rich, and dexterous hand manipulation tasks. Compared to prior real-world RL approaches, SimLauncher significantly improves sample efficiency and achieves near-perfect success rates. We hope this work serves as a proof of concept and inspires further research on leveraging large-scale simulation pre-training to benefit real-world robotic RL.
中文摘要:SimLauncher是一种创新框架,通过在仿真环境中预训练策略并应用于现实世界,显著提升了机器人强化学习的样本效率和探索能力,在复杂任务中实现了接近完美的成功率。
English Summary: SimLauncher is a novel framework that enhances real-world robotic reinforcement learning by pre-training policies in simulation and using them to improve sample efficiency and exploration, achieving near-perfect success rates in complex tasks.
Authors:Jiaming Zhang, Yuyuan Li, Yiqun Xu, Li Zhang, Xiaohua Feng, Zhifei Ren, Chaochao Chen
Abstract:
Large Language Model-enhanced Recommender Systems (LLM-enhanced RSs) have emerged as a powerful approach to improving recommendation quality by leveraging LLMs to generate item representations. Despite these advancements, the integration of LLMs raises severe fairness concerns. Existing studies reveal that LLM-based RSs exhibit greater unfairness than traditional RSs, yet fairness issues in LLM-enhanced RSs remain largely unexplored. In this paper, our empirical study reveals that while LLM-enhanced RSs improve fairness across item groups, a significant fairness gap persists. Further enhancement remains challenging due to the architectural differences and varying sources of unfairness inherent in LLM-enhanced RSs. To bridge this gap, we first decompose unfairness into i) \textit{prior unfairness} in LLM-generated representations and ii) \textit{training unfairness} in recommendation models. Then, we propose BiFair, a bi-level optimization-based fairness-aware training framework designed to mitigate both prior and training unfairness simultaneously. BiFair optimizes two sets of learnable parameters: LLM-generated representations and a trainable projector in the recommendation model, using a two-level nested optimization process. Additionally, we introduce an adaptive inter-group balancing mechanism, leveraging multi-objective optimization principles to dynamically balance fairness across item groups. Extensive experiments on three real-world datasets demonstrate that BiFair significantly mitigates unfairness and outperforms previous state-of-the-art methods.
中文: 大语言模型增强的推荐系统虽提升了推荐质量,却引发了公平性问题;本文提出的BiFair框架通过双层优化同时缓解先验与训练不公平,显著优于现有方法。
English: Large Language Model-enhanced Recommender Systems improve recommendation quality but introduce fairness concerns, which the proposed BiFair framework effectively addresses by mitigating both prior and training unfairness through bi-level optimization.
Authors:Jinwei Hu, Yi Dong, Zhengtao Ding, Xiaowei Huang
Abstract:
This paper presents a defense framework for enhancing the safety of large language model (LLM) empowered multi-agent systems (MAS) in safety-critical domains such as aerospace. We apply randomized smoothing, a statistical robustness certification technique, to the MAS consensus context, enabling probabilistic guarantees on agent decisions under adversarial influence. Unlike traditional verification methods, our approach operates in black-box settings and employs a two-stage adaptive sampling mechanism to balance robustness and computational efficiency. Simulation results demonstrate that our method effectively prevents the propagation of adversarial behaviors and hallucinations while maintaining consensus performance. This work provides a practical and scalable path toward safe deployment of LLM-based MAS in real-world, high-stakes environments.
中文: 本文提出了一种基于随机平滑的防御框架,为关键领域的大型语言模型多智能体系统提供概率安全保证,通过自适应采样方法在保持共识的同时有效防止对抗行为传播。
English: This paper introduces a randomized smoothing-based defense framework that provides probabilistic safety guarantees for large language model-powered multi-agent systems in critical domains, effectively preventing adversarial propagation while maintaining consensus through an adaptive sampling approach.
Authors:Jinwei Hu, Zezhi Tang, Xin Jin, Benyuan Zhang, Yi Dong, Xiaowei Huang
Abstract:
This paper presents HERO (Hierarchical Testing with Rabbit Optimization), a novel black-box adversarial testing framework for evaluating the robustness of deep learning-based Prognostics and Health Management systems in Industrial Cyber-Physical Systems. Leveraging Artificial Rabbit Optimization, HERO generates physically constrained adversarial examples that align with real-world data distributions via global and local perspective. Its generalizability ensures applicability across diverse ICPS scenarios. This study specifically focuses on the Proton Exchange Membrane Fuel Cell system, chosen for its highly dynamic operational conditions, complex degradation mechanisms, and increasing integration into ICPS as a sustainable and efficient energy solution. Experimental results highlight HERO's ability to uncover vulnerabilities in even state-of-the-art PHM models, underscoring the critical need for enhanced robustness in real-world applications. By addressing these challenges, HERO demonstrates its potential to advance more resilient PHM systems across a wide range of ICPS domains.
中文: HERO是一种新颖的黑盒对抗测试框架,利用人工兔优化算法生成符合物理约束的对抗样本,用于评估工业信息物理系统中基于深度学习的预测与健康管理系统的鲁棒性。
English: HERO is a novel black-box adversarial testing framework that uses Artificial Rabbit Optimization to evaluate the robustness of deep learning-based Prognostics and Health Management systems in Industrial Cyber-Physical Systems by generating physically constrained adversarial examples.
Authors:Yi Li, Xiaoxiong Wang, Jiawei Wang, Yi Chang, Kai Cao, Luxin Yan
Abstract:
While image dehazing has advanced substantially in the past decade, most efforts have focused on short-range scenarios, leaving long-range haze removal under-explored. As distance increases, intensified scattering leads to severe haze and signal loss, making it impractical to recover distant details solely from visible images. Near-infrared, with superior fog penetration, offers critical complementary cues through multimodal fusion. However, existing methods focus on content integration while often neglecting haze embedded in visible images, leading to results with residual haze. In this work, we argue that the infrared and visible modalities not only provide complementary low-level visual features, but also share high-level semantic consistency. Motivated by this, we propose a Hierarchical Semantic-Visual Fusion (HSVF) framework, comprising a semantic stream to reconstruct haze-free scenes and a visual stream to incorporate structural details from the near-infrared modality. The semantic stream first acquires haze-robust semantic prediction by aligning modality-invariant intrinsic representations. Then the shared semantics act as strong priors to restore clear and high-contrast distant scenes under severe haze degradation. In parallel, the visual stream focuses on recovering lost structural details from near-infrared by fusing complementary cues from both visible and near-infrared images. Through the cooperation of dual streams, HSVF produces results that exhibit both high-contrast scenes and rich texture details. Moreover, we introduce a novel pixel-aligned visible-infrared haze dataset with semantic labels to facilitate benchmarking. Extensive experiments demonstrate the superiority of our method over state-of-the-art approaches in real-world long-range haze removal.
中文: 本文提出了一种分层语义-视觉融合框架,通过结合近红外与可见光图像的语义一致性和结构细节,有效消除远距离场景中的雾霾,性能优于现有方法。
English: This paper introduces a Hierarchical Semantic-Visual Fusion framework that leverages near-infrared and visible images to effectively remove haze from long-range scenes by combining semantic consistency with structural details, outperforming existing methods.
Authors:Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, Rui Yan
Abstract:
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving the complex reasoning abilities of large language models (LLMs). However, current RLVR methods face two significant challenges: the near-miss reward problem, where a small mistake can invalidate an otherwise correct reasoning process, greatly hindering training efficiency; and exploration stagnation, where models tend to focus on solutions within their ``comfort zone,'' lacking the motivation to explore potentially more effective alternatives. To address these challenges, we propose StepHint, a novel RLVR algorithm that utilizes multi-level stepwise hints to help models explore the solution space more effectively. StepHint generates valid reasoning chains from stronger models and partitions these chains into reasoning steps using our proposed adaptive partitioning method. The initial few steps are used as hints, and simultaneously, multiple-level hints (each comprising a different number of steps) are provided to the model. This approach directs the model's exploration toward a promising solution subspace while preserving its flexibility for independent exploration. By providing hints, StepHint mitigates the near-miss reward problem, thereby improving training efficiency. Additionally, the external reasoning pathways help the model develop better reasoning abilities, enabling it to move beyond its ``comfort zone'' and mitigate exploration stagnation. StepHint outperforms competitive RLVR enhancement methods across six mathematical benchmarks, while also demonstrating superior generalization and excelling over baselines on out-of-domain benchmarks.
中文摘要:StepHint是一种新颖的RLVR算法,通过多层级逐步提示解决大型语言模型中的近似奖励问题和探索停滞,在数学基准测试中表现优异,同时提升了训练效率和泛化能力。
English Summary: StepHint is a novel RLVR algorithm that uses multi-level stepwise hints to address the near-miss reward problem and exploration stagnation in LLMs, demonstrating superior performance across mathematical benchmarks while improving training efficiency and generalization.
Authors:Chenhao Xue, Kezhi Li, Jiaxing Zhang, Yi Ren, Zhengyuan Shi, Chen Zhang, Yibo Lin, Lining Zhang, Qiang Xu, Guangyu Sun
Abstract:
Arithmetic circuits, such as adders and multipliers, are fundamental components of digital systems, directly impacting the performance, power efficiency, and area footprint. However, optimizing these circuits remains challenging due to the vast design space and complex physical constraints. While recent deep learning-based approaches have shown promise, they struggle to consistently explore high-potential design variants, limiting their optimization efficiency. To address this challenge, we propose AC-Refiner, a novel arithmetic circuit optimization framework leveraging conditional diffusion models. Our key insight is to reframe arithmetic circuit synthesis as a conditional image generation task. By carefully conditioning the denoising diffusion process on target quality-of-results (QoRs), AC-Refiner consistently produces high-quality circuit designs. Furthermore, the explored designs are used to fine-tune the diffusion model, which focuses the exploration near the Pareto frontier. Experimental results demonstrate that AC-Refiner generates designs with superior Pareto optimality, outperforming state-of-the-art baselines. The performance gain is further validated by integrating AC-Refiner into practical applications.
中文: AC-Refiner是一种新型算术电路优化框架,通过条件扩散模型将电路合成重构为条件图像生成任务,能持续产生具有优越帕累托最优性的高质量设计。
English: AC-Refiner is a novel arithmetic circuit optimization framework that leverages conditional diffusion models to reframe circuit synthesis as a conditional image generation task, consistently producing high-quality designs with superior Pareto optimality.
Authors:Dongyi Ding, Tiannan Wang, Chenghao Zhu, Meiling Tao, Yuchen Eleanor Jiang, Wangchunshu Zhou
Abstract:
Large language models (LLMs) excel at reasoning tasks requiring long thought sequences for planning, reflection, and refinement. However, their substantial model size and high computational demands are impractical for widespread deployment. Yet, small language models (SLMs) often struggle to learn long-form CoT reasoning due to their limited capacity, a phenomenon we refer to as the "SLMs Learnability Gap". To address this, we introduce \textbf{Mi}d-\textbf{Co}T \textbf{T}eacher \textbf{A}ssistant Distillation (MiCoTAl), a framework for improving long CoT distillation for SLMs. MiCoTA employs intermediate-sized models as teacher assistants and utilizes intermediate-length CoT sequences to bridge both the capacity and reasoning length gaps. Our experiments on downstream tasks demonstrate that although SLMs distilled from large teachers can perform poorly, by applying MiCoTA, they achieve significant improvements in reasoning performance. Specifically, Qwen2.5-7B-Instruct and Qwen2.5-3B-Instruct achieve an improvement of 3.47 and 3.93 respectively on average score on AIME2024, AMC, Olympiad, MATH-500 and GSM8K benchmarks. To better understand the mechanism behind MiCoTA, we perform a quantitative experiment demonstrating that our method produces data more closely aligned with base SLM distributions. Our insights pave the way for future research into long-CoT data distillation for SLMs.
中文:MiCoTA是一种利用中等模型和推理序列弥合能力差距的蒸馏框架,使小型语言模型在复杂任务上实现显著的推理性能提升。
English: MiCoTA is a distillation framework that uses intermediate-sized models and reasoning sequences to bridge the capacity gap, enabling small language models to achieve significant reasoning improvements on complex tasks.
Authors:Nora Gourmelon, Marcel Dreier, Martin Mayr, Thorsten Seehaus, Dakota Pyles, Matthias Braun, Andreas Maier, Vincent Christlein
Abstract:
Glaciers are losing ice mass at unprecedented rates, increasing the need for accurate, year-round monitoring to understand frontal ablation, particularly the factors driving the calving process. Deep learning models can extract calving front positions from Synthetic Aperture Radar imagery to track seasonal ice losses at the calving fronts of marine- and lake-terminating glaciers. The current state-of-the-art model relies on ImageNet-pretrained weights. However, they are suboptimal due to the domain shift between the natural images in ImageNet and the specialized characteristics of remote sensing imagery, in particular for Synthetic Aperture Radar imagery. To address this challenge, we propose two novel self-supervised multimodal pretraining techniques that leverage SSL4SAR, a new unlabeled dataset comprising 9,563 Sentinel-1 and 14 Sentinel-2 images of Arctic glaciers, with one optical image per glacier in the dataset. Additionally, we introduce a novel hybrid model architecture that combines a Swin Transformer encoder with a residual Convolutional Neural Network (CNN) decoder. When pretrained on SSL4SAR, this model achieves a mean distance error of 293 m on the "CAlving Fronts and where to Find thEm" (CaFFe) benchmark dataset, outperforming the prior best model by 67 m. Evaluating an ensemble of the proposed model on a multi-annotator study of the benchmark dataset reveals a mean distance error of 75 m, approaching the human performance of 38 m. This advancement enables precise monitoring of seasonal changes in glacier calving fronts.
中文: 通过自监督预训练和混合Swin Transformer-CNN架构,深度学习模型在冰川崩解前沿监测方面取得重大突破,将平均距离误差降至293米,接近人类标注水平。
English: Deep learning models using self-supervised pretraining on specialized SAR imagery and a hybrid Swin Transformer-CNN architecture significantly improve glacier calving front monitoring, reducing mean distance error to 293 meters and approaching human-level accuracy.
Authors:Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, Ang Li
Abstract:
Large Language Models (LLMs) have gained significant attention due to their versatility across a wide array of applications. Fine-tuning LLMs with parameter-efficient adapters, such as Low-Rank Adaptation (LoRA), enables these models to efficiently adapt to downstream tasks without extensive retraining. Deploying fine-tuned LLMs on multi-tenant edge devices offers substantial benefits, such as reduced latency, enhanced privacy, and personalized responses. However, serving LLMs efficiently on resource-constrained edge devices presents critical challenges, including the complexity of adapter selection for different tasks and memory overhead from frequent adapter swapping. Moreover, given the multiple requests in multi-tenant settings, processing requests sequentially results in underutilization of computational resources and increased latency. This paper introduces EdgeLoRA, an efficient system for serving LLMs on edge devices in multi-tenant environments. EdgeLoRA incorporates three key innovations: (1) an adaptive adapter selection mechanism to streamline the adapter configuration process; (2) heterogeneous memory management, leveraging intelligent adapter caching and pooling to mitigate memory operation overhead; and (3) batch LoRA inference, enabling efficient batch processing to significantly reduce computational latency. Comprehensive evaluations using the Llama3.1-8B model demonstrate that EdgeLoRA significantly outperforms the status quo (i.e., llama.cpp) in terms of both latency and throughput. The results demonstrate that EdgeLoRA can achieve up to a 4 times boost in throughput. Even more impressively, it can serve several orders of magnitude more adapters simultaneously. These results highlight EdgeLoRA's potential to transform edge deployment of LLMs in multi-tenant scenarios, offering a scalable and efficient solution for resource-constrained environments.
中文: EdgeLoRA通过自适应适配器选择、异构内存管理和批量LoRA推理三大创新,在多租户边缘设备上高效部署大语言模型,实现了吞吐量提升高达4倍,并能同时处理数量级更多的适配器。
English: EdgeLoRA introduces an efficient system for serving LLMs on edge devices by incorporating adaptive adapter selection, heterogeneous memory management, and batch LoRA inference, achieving up to 4 times higher throughput and enabling simultaneous handling of significantly more adapters in multi-tenant environments.
Authors:Enzhi Zhou, Yue Xiao, Ziyue Liu, Sotiris A. Tegos, Panagiotis D. Diamantoulakis, George K. Karagiannidis
Abstract:
With the rapid development of aerial infrastructure, unmanned aerial vehicles (UAVs) that function as aerial base stations (ABSs) extend terrestrial network services into the sky, enabling on-demand connectivity and enhancing emergency communication capabilities in cellular networks by leveraging the flexibility and mobility of UAVs. In such a UAV-assisted network, this paper investigates position-based beamforming between ABSs and ground users (GUs). To mitigate inter-cell interference, we propose a novel fluid aerial network that leverages ABS rotation to increase multi-cell capacity and overall network efficiency. Specifically, considering the line-of-sight channel model, the spatial beamforming weights are determined by the orientation angles of the GUs. In this direction, we examine the beamforming gain of a two-dimensional multiple-input multiple-output (MIMO) array at various ground positions, revealing that ABS rotation significantly affects multi-user channel correlation and inter-cell interference. Based on these findings, we propose an alternative low-complexity algorithm to design the optimal rotation angle for ABSs, aiming to reduce inter-cell interference and thus maximize the sum rate of multi-cell systems. In simulations, exhaustive search serves as a benchmark to validate the optimization performance of the proposed sequential ABS rotation scheme. Moreover, simulation results demonstrate that, in interference-limited regions, the proposed ABS rotation paradigm can significantly reduce inter-cell interference in terrestrial networks and improve the multi-cell sum rate by approximately 10\% compared to fixed-direction ABSs without rotation.
中文摘要:本文提出了一种利用旋转无人机空中基站构建的流体空中网络,通过优化波束成形来降低小区间干扰,在干扰受限场景下将多小区总速率提升约10%,显著提高了网络容量和效率。
English Summary: This paper introduces a fluid aerial network using rotating UAV-based aerial base stations to optimize beamforming and reduce inter-cell interference, significantly enhancing multi-cell capacity and network efficiency by approximately 10% in interference-limited scenarios.
Authors:Qingqiu Li, Runtian Yuan, Junlin Hou, Jilan Xu, Yuejie Zhang, Rui Feng, Hao Chen
Abstract:
To enable more accurate diagnosis of lung disease in chest CT scans, we propose a straightforward yet effective model. Firstly, we analyze the characteristics of 3D CT scans and remove non-lung regions, which helps the model focus on lesion-related areas and reduces computational cost. We adopt ResNeSt50 as a strong feature extractor, and use a weighted cross-entropy loss to mitigate class imbalance, especially for the underrepresented squamous cell carcinoma category. Our model achieves a Macro F1 Score of 0.80 on the validation set of the Fair Disease Diagnosis Challenge, demonstrating its strong performance in distinguishing between different lung conditions.
中文: 我们提出了一种高效的肺部疾病诊断模型,通过去除非肺区域并采用ResNeSt50和加权损失解决类别不平衡问题,在验证集上取得了0.80的宏平均F1分数。
English: We propose an efficient model for lung disease diagnosis in CT scans by removing non-lung regions and using ResNeSt50 with weighted loss to address class imbalance, achieving a 0.80 Macro F1 Score.
Authors:JianChao Zhao, Chenhao Ding, Songlin Dong, Yuhang He, Yihong Gong
Abstract:
Continual Test-Time Adaptation (CTTA) aims to enable models to adapt on-the-fly to a stream of unlabeled data under evolving distribution shifts. However, existing CTTA methods typically rely on shared model parameters across all domains, making them vulnerable to feature entanglement and catastrophic forgetting in the presence of large or non-stationary domain shifts. To address this limitation, we propose ExPaMoE, a novel framework based on an Expandable Parallel Mixture-of-Experts architecture. ExPaMoE decouples domain-general and domain-specific knowledge via a dual-branch expert design with token-guided feature separation, and dynamically expands its expert pool based on a Spectral-Aware Online Domain Discriminator (SODD) that detects distribution changes in real-time using frequency-domain cues. Extensive experiments demonstrate the superiority of ExPaMoE across diverse CTTA scenarios. We evaluate our method on standard benchmarks including CIFAR-10C, CIFAR-100C, ImageNet-C, and Cityscapes-to-ACDC for semantic segmentation. Additionally, we introduce ImageNet++, a large-scale and realistic CTTA benchmark built from multiple ImageNet-derived datasets, to better reflect long-term adaptation under complex domain evolution. ExPaMoE consistently outperforms prior arts, showing strong robustness, scalability, and resistance to forgetting.
中文: ExPaMoE提出了一种可扩展的并行专家混合框架,通过动态分离领域知识和实时检测分布变化,在持续测试时适应的多个基准测试中展现出卓越性能。
English: ExPaMoE introduces an expandable mixture-of-experts framework that dynamically separates domain knowledge and detects real-time distribution shifts, demonstrating superior performance across multiple benchmarks in continual test-time adaptation.
Authors:Jizhou Han, Chenhao Ding, SongLin Dong, Yuhang He, Xinyuan Gao, Yihong Gong
Abstract:
Visual-language models (VLMs) like CLIP exhibit strong generalization but struggle with distribution shifts at test time. Existing training-free test-time adaptation (TTA) methods operate strictly within CLIP's original feature space, relying on high-confidence samples while overlooking the potential of low-confidence ones. We propose MS-TTA, a training-free approach that enhances feature representations beyond CLIP's space using a single-step k-nearest neighbors (kNN) Mean-Shift. By refining all test samples, MS-TTA improves feature compactness and class separability, leading to more stable adaptation. Additionally, a cache of refined embeddings further enhances inference by providing Mean Shift enhanced logits. Extensive evaluations on OOD and cross-dataset benchmarks demonstrate that MS-TTA consistently outperforms state-of-the-art training-free TTA methods, achieving robust adaptation without requiring additional training.
中文: MS-TTA是一种无需训练即可增强CLIP特征表示的测试时适应方法,通过k近邻均值漂移技术提升模型对分布变化的鲁棒性,无需额外训练。
English: MS-TTA is a training-free test-time adaptation method that enhances CLIP's feature representations using kNN Mean-Shift, improving robustness against distribution shifts without additional training.
Authors:Miaosen Zhang, Ziqiang Xu, Jialiang Zhu, Qi Dai, Kai Qiu, Yifan Yang, Chong Luo, Tianyi Chen, Justin Wagle, Tim Franklin, Baining Guo
Abstract:
With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{"Iron Man"}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}
中文: Phi-Ground模型系列在计算机使用代理的GUI基础任务中取得了最先进的性能,解决了当前模型在准确执行操作方面的关键需求。
English: The Phi-Ground model family achieves state-of-the-art performance in GUI grounding for computer use agents, addressing the critical need for accurate action execution despite current models' limitations.
Authors:Shaofei Cai, Zhancun Mu, Haiwen Xia, Bowei Zhang, Anji Liu, Yitao Liang
Abstract:
While Reinforcement Learning (RL) has achieved remarkable success in language modeling, its triumph hasn't yet fully translated to visuomotor agents. A primary challenge in RL models is their tendency to overfit specific tasks or environments, thereby hindering the acquisition of generalizable behaviors across diverse settings. This paper provides a preliminary answer to this challenge by demonstrating that RL-finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen worlds. Specifically, we explore RL's potential to enhance generalizable spatial reasoning and interaction capabilities in 3D worlds. To address challenges in multi-task RL representation, we analyze and establish cross-view goal specification as a unified multi-task goal space for visuomotor policies. Furthermore, to overcome the significant bottleneck of manual task design, we propose automated task synthesis within the highly customizable Minecraft environment for large-scale multi-task RL training, and we construct an efficient distributed RL framework to support this. Experimental results show RL significantly boosts interaction success rates by $4\times$ and enables zero-shot generalization of spatial reasoning across diverse environments, including real-world settings. Our findings underscore the immense potential of RL training in 3D simulated environments, especially those amenable to large-scale task generation, for significantly advancing visuomotor agents' spatial reasoning.
中文摘要:在《我的世界》中通过强化学习微调的视觉运动智能体实现了对未知环境的零样本泛化,无需人工设计任务即可显著提升交互成功率和空间推理能力。
English Summary: Reinforcement Learning fine-tuned visuomotor agents in Minecraft demonstrate zero-shot generalization to new environments, significantly improving interaction success rates and spatial reasoning capabilities without manual task design.
Authors:Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, Jianyu Chen, Jiang Bian
Abstract:
Vision-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent works have begun to explore the incorporation of latent actions, abstract representations of motion between two frames, into VLA pre-training. In this paper, we introduce villa-X, a novel Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. We demonstrate that villa-X can generate latent action plans in a zero-shot fashion, even for unseen embodiments and open-vocabulary symbolic understanding. This capability enables villa-X to achieve superior performance across diverse simulation tasks in SIMPLER and on two real-world robotic setups involving both gripper and dexterous hand manipulation. These results establish villa-X as a principled and scalable paradigm for learning generalizable robot manipulation policies. We believe it provides a strong foundation for future research.
中文: 本文提出villa-X,一种视觉-语言-潜在动作框架,通过改进潜在动作建模来学习通用机器人操作策略,在仿真和真实任务中展现出卓越的零样本性能。
English: The paper introduces villa-X, a Vision-Language-Latent-Action framework that enhances latent action modeling to develop generalizable robot manipulation policies, demonstrating superior zero-shot performance in simulations and real-world tasks.
Authors:Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui, Xin Geng
Abstract:
Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4$\times$ less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.
中文: DivControl提出了一种可分解的预训练框架,通过奇异值分解和知识分流将条件无关与条件特定组件解耦,以极低的训练成本实现了卓越的可控性,并对新条件表现出强大的泛化能力。
English: DivControl introduces a decomposable pretraining framework that disentangles condition-agnostic and condition-specific components via SVD and knowledge diversion, achieving superior controllability with drastically reduced training costs and strong generalization to novel conditions.
Authors:Yung-Hsu Yang, Luigi Piccinelli, Mattia Segu, Siyuan Li, Rui Huang, Yuqian Fu, Marc Pollefeys, Hermann Blum, Zuria Bauer
Abstract:
Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D to Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models are available at royyang0714.github.io/3D-MOOD.
Chinese: 本文提出首个端到端单目3D开放集目标检测器3D-MOOD,通过将2D检测提升至3D空间实现对新场景和物体类别的泛化,在封闭集和开放集评估中均取得了最优性能。
English: This paper introduces 3D-MOOD, the first end-to-end monocular 3D open-set object detector that generalizes to new environments and object categories by lifting 2D detections into 3D space and achieves state-of-the-art results in both closed-set and open-set evaluations.
Authors:Haowei Lin, Haotian Ye, Wenzheng Feng, Quzhe Huang, Yujun Li, Hubert Lim, Zhengrui Li, Xiangyu Wang, Jianzhu Ma, James Zou, Yitao Liang
Abstract:
Discovering scaling laws for predicting model performance at scale is a fundamental and open-ended challenge, mostly reliant on slow, case specific human experimentation. To investigate the potential for LLMs to automate this process, we collect over 5,000 experiments from existing literature and curate seven diverse scaling law discovery tasks. While existing agents struggle to produce accurate law formulas, this paper introduces SLDAgent, an evolution-based agent that co-optimize the scaling law model and the parameters, enabling it to autonomously explore complex relationships between variables. For the first time, we demonstrates that SLDAgent can automatically discover laws that exhibit consistently more accurate extrapolation than their established, human-derived counterparts across all tasks. Through comprehensive analysis, we elucidate why these discovered laws are superior and verify their practical utility in both pretraining and finetuning applications. This work establishes a new paradigm for agentic scientific discovery, showing that AI systems can understand their own scaling behavior, and can contribute novel and practical knowledge back to the research community.
中文: 本文提出的SLDAgent通过协同优化自主发现扩展定律,首次证明人工智能生成的定律在各项任务中均能超越人工推导的对应定律,在预测精度和实际应用方面展现出显著优势。
English: This paper introduces SLDAgent, an evolution-based agent that autonomously discovers scaling laws through co-optimization, demonstrating for the first time that AI-generated laws consistently outperform human-derived counterparts in accuracy and practical utility across diverse tasks.
Authors:Harsh Purohit, Tomoya Nishida, Kota Dohi, Takashi Endo, Yohei Kawaguchi
Abstract:
This paper proposes a method for generating machine-type-specific anomalies to evaluate the relative performance of unsupervised anomalous sound detection (UASD) systems across different machine types, even in the absence of real anomaly sound data. Conventional keyword-based data augmentation methods often produce unrealistic sounds due to their reliance on manually defined labels, limiting scalability as machine types and anomaly patterns diversify. Advanced audio generative models, such as MIMII-Gen, show promise but typically depend on anomalous training data, making them less effective when diverse anomalous examples are unavailable. To address these limitations, we propose a novel synthesis approach leveraging large language models (LLMs) to interpret textual descriptions of faults and automatically select audio transformation functions, converting normal machine sounds into diverse and plausible anomalous sounds. We validate this approach by evaluating a UASD system trained only on normal sounds from five machine types, using both real and synthetic anomaly data. Experimental results reveal consistent trends in relative detection difficulty across machine types between synthetic and real anomalies. This finding supports our hypothesis and highlights the effectiveness of the proposed LLM-based synthesis approach for relative evaluation of UASD systems.
中文摘要:本文提出了一种利用大型语言模型根据文本故障描述生成逼真机器异常声音的新方法,可在缺乏真实异常数据的情况下,有效评估不同机型无监督异常声音检测系统的相对性能。
English Summary: This paper introduces a novel method using large language models to generate realistic anomalous machine sounds from textual fault descriptions, enabling effective relative performance evaluation of unsupervised anomalous sound detection systems across different machine types without requiring real anomaly data.
Authors:Xun Liang, Xin Guo, Zhongming Jin, Weihang Pan, Penghui Shang, Deng Cai, Binbin Lin, Jieping Ye
Abstract:
The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly in recent years, they are still struggling with the spatial reasoning task. In this paper, we introduce a method that can enhance Spatial reasoning through Visual and Textual thinking Simultaneously (SpatialVTS). In the spatial visual thinking phase, our model is trained to generate location-related specific tokens of essential targets automatically. Not only are the objects mentioned in the problem addressed, but also the potential objects related to the reasoning are considered. During the spatial textual thinking phase, Our model conducts long-term thinking based on visual cues and dialogues, gradually inferring the answers to spatial reasoning problems. To effectively support the model's training, we perform manual corrections to the existing spatial reasoning dataset, eliminating numerous incorrect labels resulting from automatic annotation, restructuring the data input format to enhance generalization ability, and developing thinking processes with logical reasoning details. Without introducing additional information (such as masks or depth), our model's overall average level in several spatial understanding tasks has significantly improved compared with other models.
Chinese Summary: 本文提出SpatialVTS方法,通过视觉与文本双重思维阶段同步增强视觉语言模型的空间推理能力,在不引入掩码或深度等额外信息的情况下,显著提升了多项空间理解任务的整体表现水平。
English Summary: This paper introduces SpatialVTS, a method that enhances spatial reasoning in vision language models through simultaneous visual and textual thinking phases, significantly improving performance on spatial understanding tasks without requiring additional data like masks or depth.
Authors:Maojun Zhang, Guangxu Zhu, Xiaoming Chen, Kaibin Huang, Zhaoyang Zhang
Abstract:
Inter-user interference remains a critical bottleneck in wireless communication systems, particularly in the emerging paradigm of semantic communication (SemCom). Compared to traditional systems, inter-user interference in SemCom severely degrades key semantic information, often causing worse performance than Gaussian noise under the same power level. To address this challenge, inspired by the recently proposed concept of Orthogonal Model Division Multiple Access (OMDMA) that leverages semantic orthogonality rooted in the personalized joint source and channel (JSCC) models to distinguish users, we propose a novel, scalable framework that eliminates the need for user-specific JSCC models as did in original OMDMA. Our key innovation lies in shuffle-based orthogonalization, where randomly permuting the positions of JSCC feature vectors transforms inter-user interference into Gaussian-like noise. By assigning each user a unique shuffling pattern, the interference is treated as channel noise, enabling effective mitigation using diffusion models (DMs). This approach not only simplifies system design by requiring a single universal JSCC model but also enhances privacy, as shuffling patterns act as implicit private keys. Additionally, we extend the framework to scenarios involving semantically correlated data. By grouping users based on semantic similarity, a cooperative beamforming strategy is introduced to exploit redundancy in correlated data, further improving system performance. Extensive simulations demonstrate that the proposed method outperforms state-of-the-art multi-user SemCom frameworks, achieving superior semantic fidelity, robustness to interference, and scalability-all without requiring additional training overhead.
中文摘要:该框架通过基于混洗的正交化方法,将语义通信中的用户间干扰转化为类高斯噪声,利用扩散模型有效抑制干扰,无需额外训练即可提升系统性能。
English Summary: The proposed framework eliminates inter-user interference in semantic communication by using shuffle-based orthogonalization, which transforms interference into Gaussian-like noise and allows mitigation with diffusion models, enhancing performance without extra training.
Authors:Maya Larbi, Amal Akli, Mike Papadakis, Rihab Bouyousfi, Maxime Cordy, Federica Sarro, Yves Le Traon
Abstract:
Large Language Models (LLMs) have demonstrated impressive performance in code generation tasks under idealized conditions, where task descriptions are clear and precise. However, in practice, task descriptions frequently exhibit ambiguity, incompleteness, or internal contradictions. In this paper, we present the first empirical study examining the robustness of state-of-the-art code generation models when faced with such unclear task descriptions. We extend the HumanEval and MBPP benchmarks by systematically introducing realistic task descriptions flaws through guided mutation strategies, producing a dataset that mirrors the messiness of informal developer instructions. We evaluate multiple LLMs of varying sizes and architectures, analyzing their functional correctness and failure modes across task descriptions categories. Our findings reveal that even minor imperfections in task description phrasing can cause significant performance degradation, with contradictory task descriptions resulting in numerous logical errors. Moreover, while larger models tend to be more resilient than smaller variants, they are not immune to the challenges posed by unclear requirements. We further analyze semantic error patterns and identify correlations between description clarity, model behavior, and error types. Our results underscore the critical need for developing LLMs that are not only powerful but also robust to the imperfections inherent in natural user tasks, highlighting important considerations for improving model training strategies, designing more realistic evaluation benchmarks, and ensuring reliable deployment in practical software development environments.
中文摘要:大型语言模型在面对模糊或不完整的任务描述时性能显著下降,研究表明即使最先进的模型也需要增强对自然语言缺陷的鲁棒性,这对实际软件开发部署至关重要。
English Summary: Large Language Models suffer significant performance drops when faced with ambiguous or contradictory task descriptions, revealing the critical need for developing more robust models despite larger models showing relative resilience.
Authors:Guozhu Meng, Zhixiu Guo, Xiaodong Zhang, Haoyu Wang, Kai Chen, Yang Liu
Abstract:
It is well known that antivirus engines are vulnerable to evasion techniques (e.g., obfuscation) that transform malware into its variants. However, it cannot be necessarily attributed to the effectiveness of these evasions, and the limits of engines may also make this unsatisfactory result. In this study, we propose a data-driven approach to measure the effect of app transformations to malware detection, and further explain why the detection result is produced by these engines. First, we develop an interaction model for antivirus engines, illustrating how they respond with different detection results in terms of varying inputs. Six app transformation techniques are implemented in order to generate a large number of Android apps with traceable changes. Then we undertake a one-month tracking of app detection results from multiple antivirus engines, through which we obtain over 971K detection reports from VirusTotal for 179K apps in total. Last, we conduct a comprehensive analysis of antivirus engines based on these reports from the perspectives of signature-based, static analysis-based, and dynamic analysis-based detection techniques. The results, together with 7 highlighted findings, identify a number of sealed working mechanisms occurring inside antivirus engines and what are the indicators of compromise in apps during malware detection.
Chinese: 本研究提出了一种数据驱动的方法来评估应用转换对恶意软件检测的影响,并通过分析超过97.1万份报告揭示了杀毒引擎的内部工作机制。
English: This study introduces a data-driven method to assess how app transformations impact malware detection, revealing the internal mechanisms of antivirus engines through extensive analysis of over 971,000 reports.
Authors:Jiawen Qi, Chang Gao, Zhaochun Ren, Qinyu Chen
Abstract:
Deploying Large Language Models (LLMs) on edge devices remains challenging due to their quadratically increasing computations with the sequence length. Existing studies for dynamic attention pruning are designed for hardware with massively parallel computation capabilities, such as GPUs or TPUs, and aim at long context lengths (e.g., 64K), making them unsuitable for edge scenarios. We present DeltaLLM, a training-free framework that exploits temporal sparsity in attention patterns to enable efficient LLM inference across both the prefilling and decoding stages, on resource-constrained edge devices. DeltaLLM introduces an accuracy- and memory-aware delta matrix construction strategy that introduces temporal sparsity, and a context-aware hybrid attention mechanism that combines full attention in a local context window with delta approximation outside it to increase accuracy. We evaluate our framework on the edge-device-friendly BitNet-b1.58-2B-4T model and Llama3.2-1B-Instruct model across diverse language tasks. The results show that on BitNet, our framework increases the attention sparsity from 0% to 60% during the prefilling stage with slight accuracy improvement on the WG task, and 0% to 57% across both the prefilling and decoding stages, with even higher F1 score from 29.63 to 30.97 on SQuAD-v2 task. On the Llama model, it can also achieve up to 60% sparsity during the prefilling stage and around 57% across both stages with negligible accuracy drop. These results demonstrate that DeltaLLM offers a promising solution for efficient edge deployment, requiring no fine-tuning and seamlessly integrating with existing inference pipelines.
中文: DeltaLLM是一种无需训练即可利用注意力模式中时间稀疏性的框架,可在资源受限的边缘设备上实现高效的大语言模型推理,达到高达60%的稀疏度且对精度影响极小。
English: DeltaLLM is a training-free framework that leverages temporal sparsity in attention patterns to enable efficient LLM inference on resource-constrained edge devices, achieving up to 60% sparsity with minimal accuracy impact.
Authors:Chandravardhan Singh Raghaw, Jasmer Singh Sanjotra, Mohammad Zia Ur Rehman, Shubhi Bansal, Shahid Shafi Dar, Nagendra Kumar
Abstract:
Precise and automated segmentation of the liver and its tumor within CT scans plays a pivotal role in swift diagnosis and the development of optimal treatment plans for individuals with liver diseases and malignancies. However, automated liver and tumor segmentation faces significant hurdles arising from the inherent heterogeneity of tumors and the diverse visual characteristics of livers across a broad spectrum of patients. Aiming to address these challenges, we present a novel Transformer-aware Multiscale Progressive Encoder-Decoder Network (T-MPEDNet) for automated segmentation of tumor and liver. T-MPEDNet leverages a deep adaptive features backbone through a progressive encoder-decoder structure, enhanced by skip connections for recalibrating channel-wise features while preserving spatial integrity. A Transformer-inspired dynamic attention mechanism captures long-range contextual relationships within the spatial domain, further enhanced by multi-scale feature utilization for refined local details, leading to accurate prediction. Morphological boundary refinement is then employed to address indistinct boundaries with neighboring organs, capturing finer details and yielding precise boundary labels. The efficacy of T-MPEDNet is comprehensively assessed on two widely utilized public benchmark datasets, LiTS and 3DIRCADb. Extensive quantitative and qualitative analyses demonstrate the superiority of T-MPEDNet compared to twelve state-of-the-art methods. On LiTS, T-MPEDNet achieves outstanding Dice Similarity Coefficients (DSC) of 97.6% and 89.1% for liver and tumor segmentation, respectively. Similar performance is observed on 3DIRCADb, with DSCs of 98.3% and 83.3% for liver and tumor segmentation, respectively. Our findings prove that T-MPEDNet is an efficacious and reliable framework for automated segmentation of the liver and its tumor in CT scans.
中文: 本文提出的T-MPEDNet网络通过结合多尺度特征和边界优化,在CT扫描中实现了领先的肝脏与肿瘤自动分割效果,并在公开数据集上验证了其优越性能。
English: The authors propose T-MPEDNet, a novel Transformer-aware network that achieves state-of-the-art automated liver and tumor segmentation in CT scans by integrating multiscale features and boundary refinement, demonstrating superior performance on benchmark datasets.
Authors:Jiyao Wang, Xiao Yang, Qingyong Hu, Jiankai Tang, Can Liu, Dengbo He, Yuntao Wang, Yingcong Chen, Kaishun Wu
Abstract:
Robust and unobtrusive in-vehicle physiological monitoring is crucial for ensuring driving safety and user experience. While remote physiological measurement (RPM) offers a promising non-invasive solution, its translation to real-world driving scenarios is critically constrained by the scarcity of comprehensive datasets. Existing resources are often limited in scale, modality diversity, the breadth of biometric annotations, and the range of captured conditions, thereby omitting inherent real-world challenges in driving. Here, we present PhysDrive, the first large-scale multimodal dataset for contactless in-vehicle physiological sensing with dedicated consideration on various modality settings and driving factors. PhysDrive collects data from 48 drivers, including synchronized RGB, near-infrared camera, and raw mmWave radar data, accompanied with six synchronized ground truths (ECG, BVP, Respiration, HR, RR, and SpO2). It covers a wide spectrum of naturalistic driving conditions, including driver motions, dynamic natural light, vehicle types, and road conditions. We extensively evaluate both signal-processing and deep-learning methods on PhysDrive, establishing a comprehensive benchmark across all modalities, and release full open-source code with compatibility for mainstream public toolboxes. We envision PhysDrive will serve as a foundational resource and accelerate research on multimodal driver monitoring and smart-cockpit systems.
中文: PhysDrive作为首个大规模多模态车内无接触生理监测数据集,通过整合48名驾驶员在多样化驾驶条件下的同步生物特征数据,并建立全面基准测试与开源工具,解决了现有资源不足的问题,旨在推动驾驶员监测与智能座舱系统的研究发展。
English: PhysDrive introduces the first large-scale multimodal dataset for contactless in-vehicle physiological monitoring, addressing the limitations of existing resources by incorporating diverse driving conditions and synchronized biometric data from 48 drivers, alongside comprehensive benchmarks and open-source tools to advance driver monitoring research.
Authors:Jessica Quaye, Charvi Rastogi, Alicia Parrish, Oana Inel, Minsuk Kahng, Lora Aroyo, Vijay Janapa Reddi
Abstract:
Text-to-image (T2I) models have become prevalent across numerous applications, making their robust evaluation against adversarial attacks a critical priority. Continuous access to new and challenging adversarial prompts across diverse domains is essential for stress-testing these models for resilience against novel attacks from multiple vectors. Current techniques for generating such prompts are either entirely authored by humans or synthetically generated. On the one hand, datasets of human-crafted adversarial prompts are often too small in size and imbalanced in their cultural and contextual representation. On the other hand, datasets of synthetically-generated prompts achieve scale, but typically lack the realistic nuances and creative adversarial strategies found in human-crafted prompts. To combine the strengths of both human and machine approaches, we propose Seed2Harvest, a hybrid red-teaming method for guided expansion of culturally diverse, human-crafted adversarial prompt seeds. The resulting prompts preserve the characteristics and attack patterns of human prompts while maintaining comparable average attack success rates (0.31 NudeNet, 0.36 SD NSFW, 0.12 Q16). Our expanded dataset achieves substantially higher diversity with 535 unique geographic locations and a Shannon entropy of 7.48, compared to 58 locations and 5.28 entropy in the original dataset. Our work demonstrates the importance of human-machine collaboration in leveraging human creativity and machine computational capacity to achieve comprehensive, scalable red-teaming for continuous T2I model safety evaluation.
中文: Seed2Harvest是一种混合红队方法,通过扩展文化多样性的人工制作对抗性提示来增强文本到图像模型的韧性测试,借助人机协作实现了更高的多样性和相当的攻击成功率。
English: Seed2Harvest is a hybrid red-teaming method that expands culturally diverse human-crafted adversarial prompts to enhance the resilience testing of text-to-image models, achieving higher diversity and comparable attack success rates through human-machine collaboration.
Authors:Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu, Jiacheng Lin, Sinong Zhan, Vianne Gao, Ruochen Jiao, Junze Liu, Kun Qian, Yuxin Tang, Ran Xue, Houyu Zhang, Qingjun Cui, Yufan Guo, Dakuo Wang
Abstract:
Large Language Models (LLMs) have recently demonstrated strong potential in generating 'believable human-like' behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline.
中文摘要:大型语言模型在模拟人类在线行为方面潜力巨大,而新型Shop-R1强化学习框架通过自监督推理生成和分层奖励机制提升其推理能力,相比基线实现了超过65%的性能提升。
English Summary: Large Language Models show promise in simulating human behavior online, and the new Shop-R1 reinforcement learning framework enhances their reasoning by using self-supervised rationale generation and a hierarchical reward system, achieving over 65% improvement over baselines.
Authors:Olaf Dünkel, Artur Jesslen, Jiahao Xie, Christian Theobalt, Christian Rupprecht, Adam Kortylewski
Abstract:
An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they often fail to capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To address failure cases, we propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model failure points, providing a more nuanced understanding of model robustness. Project page including code and data: https://genintel.github.io/CNS.
中文: CNS-Bench通过扩散模型与LoRA适配器构建连续干扰变化基准,评估图像分类器在真实多变干扰下的分布外鲁棒性,发现模型排名随干扰类型和程度改变,突破了传统二元基准的局限性。
English: CNS-Bench introduces a continuous nuisance shift benchmark using diffusion models with LoRA adapters to evaluate image classifiers' out-of-distribution robustness across realistic, varying severities, revealing that model rankings shift with different nuisance types and scales, unlike binary benchmarks.
Authors:Chang Nie, Guangming Wang, Zhe Lie, Hesheng Wang
Abstract:
Robot imitation learning relies on 4D multi-view sequential images. However, the high cost of data collection and the scarcity of high-quality data severely constrain the generalization and application of embodied intelligence policies like Vision-Language-Action (VLA) models. Data augmentation is a powerful strategy to overcome data scarcity, but methods for editing 4D multi-view sequential images for manipulation tasks are currently lacking. Thus, we propose ERMV (Editing Robotic Multi-View 4D data), a novel data augmentation framework that efficiently edits an entire multi-view sequence based on single-frame editing and robot state conditions. This task presents three core challenges: (1) maintaining geometric and appearance consistency across dynamic views and long time horizons; (2) expanding the working window with low computational costs; and (3) ensuring the semantic integrity of critical objects like the robot arm. ERMV addresses these challenges through a series of innovations. First, to ensure spatio-temporal consistency in motion blur, we introduce a novel Epipolar Motion-Aware Attention (EMA-Attn) mechanism that learns pixel shift caused by movement before applying geometric constraints. Second, to maximize the editing working window, ERMV pioneers a Sparse Spatio-Temporal (STT) module, which decouples the temporal and spatial views and remodels a single-frame multi-view problem through sparse sampling of the views to reduce computational demands. Third, to alleviate error accumulation, we incorporate a feedback intervention Mechanism, which uses a Multimodal Large Language Model (MLLM) to check editing inconsistencies and request targeted expert guidance only when necessary. Extensive experiments demonstrate that ERMV-augmented data significantly boosts the robustness and generalization of VLA models in both simulated and real-world environments.
中文: ERMV是一种创新的数据增强框架,通过引入运动感知注意力、稀疏时空模块和反馈干预机制,有效编辑4D多视角机器人序列数据,显著提升了视觉语言动作模型在真实场景中的泛化能力。
English: ERMV is a novel data augmentation framework that efficiently edits 4D multi-view robotic sequences by addressing geometric consistency, computational efficiency, and semantic integrity through innovative mechanisms, significantly enhancing VLA model performance in real-world applications.
Authors:Sania Waheed, Na Min An, Michael Milford, Sarvapali D. Ramchurn, Shoaib Ehsan
Abstract:
Geo-localization from a single image at planet scale (essentially an advanced or extreme version of the kidnapped robot problem) is a fundamental and challenging task in applications such as navigation, autonomous driving and disaster response due to the vast diversity of locations, environmental conditions, and scene variations. Traditional retrieval-based methods for geo-localization struggle with scalability and perceptual aliasing, while classification-based approaches lack generalization and require extensive training data. Recent advances in vision-language models (VLMs) offer a promising alternative by leveraging contextual understanding and reasoning. However, while VLMs achieve high accuracy, they are often prone to hallucinations and lack interpretability, making them unreliable as standalone solutions. In this work, we propose a novel hybrid geo-localization framework that combines the strengths of VLMs with retrieval-based visual place recognition (VPR) methods. Our approach first leverages a VLM to generate a prior, effectively guiding and constraining the retrieval search space. We then employ a retrieval step, followed by a re-ranking mechanism that selects the most geographically plausible matches based on feature similarity and proximity to the initially estimated coordinates. We evaluate our approach on multiple geo-localization benchmarks and show that it consistently outperforms prior state-of-the-art methods, particularly at street (up to 4.51%) and city level (up to 13.52%). Our results demonstrate that VLM-generated geographic priors in combination with VPR lead to scalable, robust, and accurate geo-localization systems.
Chinese: 本文提出一种结合视觉语言模型与检索式视觉位置识别的混合地理定位框架,通过VLM生成先验信息指导检索和重排序,在多个基准测试中显著提升了街道和城市级别的定位精度与可扩展性。
English: This paper introduces a hybrid geo-localization framework that integrates vision-language models with retrieval-based visual place recognition, using VLM-generated priors to guide retrieval and re-ranking for improved accuracy and scalability across benchmarks.
Authors:Jiansong Wan, Chengming Zhou, Jinkua Liu, Xiangge Huang, Xiaoyu Chen, Xiaohan Yi, Qisen Yang, Baiting Zhu, Xin-Qiang Cai, Lixing Liu, Rushuai Yang, Chuheng Zhang, Sherif Abdelfattah, Hayong Shin, Pushi Zhang, Li Zhao, Jiang Bian
Abstract:
Recent studies have explored pretrained (foundation) models for vision-based robotic navigation, aiming to achieve generalizable navigation and positive transfer across diverse environments while enhancing zero-shot performance in unseen settings. In this work, we introduce PIG-Nav (Pretrained Image-Goal Navigation), a new approach that further investigates pretraining strategies for vision-based navigation models and contributes in two key areas. Model-wise, we identify two critical design choices that consistently improve the performance of pretrained navigation models: (1) integrating an early-fusion network structure to combine visual observations and goal images via appropriately pretrained Vision Transformer (ViT) image encoder, and (2) introducing suitable auxiliary tasks to enhance global navigation representation learning, thus further improving navigation performance. Dataset-wise, we propose a novel data preprocessing pipeline for efficiently labeling large-scale game video datasets for navigation model training. We demonstrate that augmenting existing open navigation datasets with diverse gameplay videos improves model performance. Our model achieves an average improvement of 22.6% in zero-shot settings and a 37.5% improvement in fine-tuning settings over existing visual navigation foundation models in two complex simulated environments and one real-world environment. These results advance the state-of-the-art in pretrained image-goal navigation models. Notably, our model maintains competitive performance while requiring significantly less fine-tuning data, highlighting its potential for real-world deployment with minimal labeled supervision.
中文: PIG-Nav方法通过融合预训练视觉Transformer的早期网络结构和辅助任务,提升了视觉导航模型的性能,在零样本和微调场景中显著进步,并降低了对标注数据的依赖。
English: The PIG-Nav approach enhances vision-based robotic navigation by integrating an early-fusion network with a pretrained Vision Transformer and auxiliary tasks, achieving significant performance gains in zero-shot and fine-tuning settings with reduced data needs.
Authors:Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, Xiaojun Wan
Abstract:
Large language models (LLMs) excel at various natural language processing tasks, but their tendency to generate hallucinations undermines their reliability. Existing hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations, overlooking their dynamic evolution across layers, which limits efficacy. To address this limitation, we shift the focus to the hidden state update process and introduce a novel metric, the ICR Score (Information Contribution to Residual Stream), which quantifies the contribution of modules to the hidden states' update. We empirically validate that the ICR Score is effective and reliable in distinguishing hallucinations. Building on these insights, we propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states. Experimental results show that the ICR Probe achieves superior performance with significantly fewer parameters. Furthermore, ablation studies and case analyses offer deeper insights into the underlying mechanism of this method, improving its interpretability.
中文: 本研究提出ICR评分量化模块对隐藏状态更新的贡献,并开发ICR探针检测幻觉,该方法通过捕捉跨层演化以更少参数实现更优性能。
English: This study introduces the ICR Score to quantify module contributions in hidden state updates and proposes the ICR Probe for hallucination detection, which achieves superior performance with fewer parameters by capturing cross-layer evolution.
Authors:Ding Wang, Mark DÃaz, Charvi Rastogi, Aida Davani, Vinodkumar Prabhakaran, Pushkar Mishra, Roma Patel, Alicia Parrish, Zoe Ashwood, Michela Paganini, Tian Huey Teh, Verena Rieser, Lora Aroyo
Abstract:
Understanding what constitutes safety in AI-generated content is complex. While developers often rely on predefined taxonomies, real-world safety judgments also involve personal, social, and cultural perceptions of harm. This paper examines how annotators evaluate the safety of AI-generated images, focusing on the qualitative reasoning behind their judgments. Analyzing 5,372 open-ended comments, we find that annotators consistently invoke moral, emotional, and contextual reasoning that extends beyond structured safety categories. Many reflect on potential harm to others more than to themselves, grounding their judgments in lived experience, collective risk, and sociocultural awareness. Beyond individual perceptions, we also find that the structure of the task itself -- including annotation guidelines -- shapes how annotators interpret and express harm. Guidelines influence not only which images are flagged, but also the moral judgment behind the justifications. Annotators frequently cite factors such as image quality, visual distortion, and mismatches between prompt and output as contributing to perceived harm dimensions, which are often overlooked in standard evaluation frameworks. Our findings reveal that existing safety pipelines miss critical forms of reasoning that annotators bring to the task. We argue for evaluation designs that scaffold moral reflection, differentiate types of harm, and make space for subjective, context-sensitive interpretations of AI-generated content.
中文: 研究发现,标注者通过道德、情感和情境推理评估AI生成图像的安全性,这些判断常基于生活经验和社会文化认知,而现有安全评估框架却忽视了这些关键维度,任务结构和指导方针也显著影响伤害认定的方式。
English: This study reveals that human annotators assess the safety of AI-generated images through moral, emotional, and contextual reasoning beyond predefined categories, highlighting how task structures and guidelines shape harm interpretations while exposing gaps in current evaluation frameworks.
Authors:Yuanhe Tian, Junjie Liu, Zhizhou Kou, Yuxiang Li, Yan Song
Abstract:
Building high-quality data resources is crucial for advancing artificial intelligence research and applications in specific domains, particularly in the Chinese medical domain. Existing Chinese medical datasets are limited in size and narrow in domain coverage, falling short of the diverse corpora required for effective pre-training. Moreover, most datasets are designed solely for LLM fine-tuning and do not support pre-training and reinforcement learning from human feedback (RLHF). In this paper, we propose a Chinese medical dataset named ChiMed 2.0, which extends our previous work ChiMed, and covers data collected from Chinese medical online platforms and generated by LLMs. ChiMed 2.0 contains 204.4M Chinese characters covering both traditional Chinese medicine classics and modern general medical data, where there are 164.8K documents for pre-training, 351.6K question-answering pairs for supervised fine-tuning (SFT), and 41.7K preference data tuples for RLHF. To validate the effectiveness of our approach for training a Chinese medical LLM, we conduct further pre-training, SFT, and RLHF experiments on representative general domain LLMs and evaluate their performance on medical benchmark datasets. The results show performance gains across different model scales, validating the dataset's effectiveness and applicability.
中文摘要:ChiMed 2.0提出了一个全面的中文医疗数据集,通过提供大规模预训练、微调和人类反馈强化学习数据解决了现有资源不足的问题,实验结果表明该数据集能有效提升不同规模模型的医疗领域性能。
English Summary: ChiMed 2.0 introduces a comprehensive Chinese medical dataset addressing limitations in existing resources by providing large-scale data for pre-training, fine-tuning, and RLHF, with experimental results demonstrating significant performance improvements across various model scales.
Authors:Chenlei Gong, Yuanhe Tian, Lei Mao, Yan Song
Abstract:
Currently, many studies view DNA sequences as a special type of language and utilize Transformers to model them. These studies use fixed-length k-mer segmentation and BPE subword tokenization but lack a systematic evaluation to determine which is superior. We compare k-mer segmentation with k=1,3,4,5,6, a 4,096-token BPE vocabulary, and three positional encoding methods-sinusoidal, AliBi, and RoPE. Each configuration is trained from scratch in 3, 6, 12, and 24-layer Transformer encoders and evaluated on GUE benchmark dataset. In general, BPE delivers higher and more stable performance across tasks by compressing frequent motifs into variable-length tokens, reducing sequence length, and improving model generalization. RoPE excels at capturing periodic motifs and extrapolating to long sequences, while AliBi also performs well on tasks driven by local dependencies. In terms of depth, we observe significant gains when increasing layers from 3 to 12, with only marginal improvements or slight overfitting at 24 layers. This study provides practical guidance for designing tokenization and positional encoding in DNA Transformer models.
中文: 本研究系统评估了DNA序列建模中的k-mer分割与BPE分词及不同位置编码方法,发现BPE通过压缩频繁模因和缩短序列能获得更优性能,而RoPE在捕捉周期性模因方面表现突出,为模型设计提供了实用指导。
English: This study systematically evaluates k-mer segmentation and BPE tokenization with various positional encodings in DNA sequence modeling, finding BPE generally superior for task performance and stability while RoPE excels at capturing periodic motifs, with practical guidance provided for model design.
Authors:Chen Su, Zhengzhou Cai, Yuanhe Tian, Zhuochao Chang, Zihong Zheng, Yan Song
Abstract:
Diffusion models, initially developed for image synthesis, demonstrate remarkable generative capabilities. Recently, their application has expanded to time series forecasting (TSF), yielding promising results. Existing surveys on time series primarily focus on the application of diffusion models to time series tasks or merely provide model-by-model introductions of diffusion-based TSF models, without establishing a systematic taxonomy for existing diffusion-based TSF models. In this survey, we firstly introduce several standard diffusion models and their prevalent variants, explaining their adaptation to TSF tasks. Then, we provide a comprehensive review of diffusion models for TSF, paying special attention to the sources of conditional information and the mechanisms for integrating this conditioning within the models. In analyzing existing approaches using diffusion models for TSF, we provide a systematic categorization and a comprehensive summary of them in this survey. Furthermore, we examine several foundational diffusion models applied to TSF, alongside commonly used datasets and evaluation metrics. Finally, we discuss the progress and limitations of these approaches, as well as potential future research directions for diffusion-based TSF. Overall, this survey offers a comprehensive overview of recent progress and future prospects for diffusion models in TSF, serving as a valuable reference for researchers in the field.
中文: 本综述系统梳理了扩散模型在时间序列预测中的应用,涵盖模型分类、条件机制、数据集及未来研究方向,为相关研究提供全面参考。
English: This survey systematically categorizes and reviews diffusion models for time series forecasting, detailing their adaptations, conditional mechanisms, datasets, and future research directions.
Authors:Xiaojie Li, Zhijie Cai, Nan Qi, Chao Dong, Guangxu Zhu, Haixia Ma, Qihui Wu, Shi Jin
Abstract:
The expansion of the low-altitude economy has underscored the significance of Low-Altitude Network Coverage (LANC) prediction for designing aerial corridors. While accurate LANC forecasting hinges on the antenna beam patterns of Base Stations (BSs), these patterns are typically proprietary and not readily accessible. Operational parameters of BSs, which inherently contain beam information, offer an opportunity for data-driven low-altitude coverage prediction. However, collecting extensive low-altitude road test data is cost-prohibitive, often yielding only sparse samples per BS. This scarcity results in two primary challenges: imbalanced feature sampling due to limited variability in high-dimensional operational parameters against the backdrop of substantial changes in low-dimensional sampling locations, and diminished generalizability stemming from insufficient data samples. To overcome these obstacles, we introduce a dual strategy comprising expert knowledge-based feature compression and disentangled representation learning. The former reduces feature space complexity by leveraging communications expertise, while the latter enhances model generalizability through the integration of propagation models and distinct subnetworks that capture and aggregate the semantic representations of latent features. Experimental evaluation confirms the efficacy of our framework, yielding a 7% reduction in error compared to the best baseline algorithm. Real-network validations further attest to its reliability, achieving practical prediction accuracy with MAE errors at the 5dB level.
Chinese Summary: 该研究提出了一种结合专家知识特征压缩和解耦表示学习的双重策略,以解决低空网络覆盖预测中的数据稀缺和特征不平衡问题,在真实网络验证中实现了7%的误差降低和5dB级别的实用预测精度。
English Summary: The study introduces a dual strategy combining expert knowledge-based feature compression and disentangled representation learning to overcome data scarcity and feature imbalance challenges in low-altitude network coverage prediction, achieving a 7% error reduction and practical 5dB-level accuracy in real-network validations.
Authors:Zhixin Wang, Tianyi Zhou, Liming Liu, Ao Li, Jiarui Hu, Dian Yang, Yinhui Lu, Jinlong Hou, Siyuan Feng, Yuan Cheng, Yuan Qi
Abstract:
Reinforcement learning (RL) has become the pivotal post-training technique for large language model (LLM). Effectively scaling reinforcement learning is now the key to unlocking advanced reasoning capabilities and ensuring safe, goal-aligned behavior in the most powerful LLMs. Mainstream frameworks usually employ a hybrid-controller architecture where a single-controller dispatches the overall execution logic and manages overall data transfer and the multi-controller executes distributed computation. For large-scale reinforcement learning, minor load imbalances can introduce significant bottlenecks, ultimately constraining the scalability of the system. To address this limitation, we introduce DistFlow, a novel, fully distributed RL framework designed to break scaling barrier. We adopt a multi-controller paradigm that dispatches data transfer and execution tasks to all workers, which eliminates the centralized node. This allows each worker to operate independently, leading to near-linear scalability up to 1024 GPUs and dramatic efficiency gains. Furthermore, our architecture decouples resource configuration from execution logic, allowing each worker to have a unique execution flow, offering significant flexibility for rapid and cost-effective algorithmic experimentation. Extensive experiments show that DistFlow achieves excellent linear scalability and up to a 7x end-to-end throughput improvement in specific scenarios over state-of-the-art (SOTA) frameworks.
中文: DistFlow是一种全分布式强化学习框架,通过消除中心化控制实现了近线性的扩展性,在1024个GPU上运行,并在特定场景下比现有最优框架的端到端吞吐量提升高达7倍。
English: DistFlow is a fully distributed reinforcement learning framework that eliminates centralized control, enabling near-linear scalability up to 1024 GPUs and achieving up to 7x throughput improvement over existing systems.
Authors:Qian Wang, Yubo Fan, Zhenheng Tang, Nuo Chen, Wenxuan Wang, Bingsheng He
Abstract:
Large Reasoning Models (LRMs) like DeepSeek-R1 and o1 are increasingly used as automated evaluators, raising critical questions about their vulnerability to the aesthetics of reasoning in LLM-as-a-judge settings. We introduce THEATER, a comprehensive benchmark to systematically evaluate this vulnerability-termed Reasoning Theater Bias (RTB)-by comparing LLMs and LRMs across subjective preference and objective factual datasets. Through investigation of six bias types including Simple Cues and Fake Chain-of-Thought, we uncover three key findings: (1) in a critical paradox, reasoning-specialized LRMs are consistently more susceptible to RTB than general-purpose LLMs, particularly in subjective tasks; (2) this creates a task-dependent trade-off, where LRMs show more robustness on factual tasks but less on subjective ones; and (3) we identify 'shallow reasoning'-plausible but flawed arguments-as the most potent form of RTB. To address this, we design and evaluate two prompting strategies: a targeted system prompt that improves accuracy by up to 12% on factual tasks but only 1-3% on subjective tasks, and a self-reflection mechanism that shows similarly limited effectiveness in the more vulnerable subjective domains. Our work reveals that RTB is a deep-seated challenge for LRM-based evaluation and provides a systematic framework for developing more genuinely robust and trustworthy LRMs.
中文: 大型推理模型和标准大语言模型易受伪推理偏见影响,倾向于接受表面推理结构,THEATER基准测试显示该问题在主观任务中尤为突出,且现有缓解措施效果有限。
English: Large Reasoning Models and standard LLMs are susceptible to Fake Reasoning Bias, where they favor flawed reasoning structures, as demonstrated by the THEATER benchmark, revealing vulnerabilities especially in subjective tasks and prompting limited mitigation.
Authors:Qian Wang, Zhenheng Tang, Zhanzhi Lou, Nuo Chen, Wenxuan Wang, Bingsheng He
Abstract:
Large Reasoning Models (LRMs), evolved from standard Large Language Models (LLMs), are increasingly utilized as automated judges because of their explicit reasoning processes. Yet we show that both LRMs and standard LLMs are vulnerable to Fake Reasoning Bias (FRB), where models favor the surface structure of reasoning even when the logic is flawed. To study this problem, we introduce THEATER, a comprehensive benchmark that systematically investigates FRB by manipulating reasoning structures to test whether language models are misled by superficial or fabricated cues. It covers two FRB types: (1) Simple Cues, minimal cues that resemble reasoning processes, and (2) Fake CoT, fabricated chains of thought that simulate multi-step reasoning. We evaluate 17 advanced LLMs and LRMs on both subjective DPO and factual datasets. Our results reveal four key findings: (1) Both LLMs and LRMs are vulnerable to FRB, but LLMs are generally more robust than LRMs. (2) Simple Cues are especially harmful, reducing accuracy by up to 15% on the most vulnerable datasets. (3) Subjective DPO tasks are the most vulnerable, with LRMs suffering sharper drops than LLMs. (4) Analysis of LRMs' thinking traces shows that Simple Cues hijack metacognitive confidence, while Fake CoT is absorbed as internal thought, creating a "more thinking, less robust" paradox in LRMs. Finally, prompt-based mitigation improves accuracy on factual tasks by up to 10%, but has little effect on subjective tasks, where self-reflection sometimes lowers LRM performance by 8%. These results highlight FRB as a persistent and unresolved challenge for language models.
中文: 大型推理模型和标准大语言模型易受伪推理偏见影响,倾向于接受表面推理结构,THEATER基准测试显示该问题在主观任务中尤为突出,且现有缓解措施效果有限。
English: Large Reasoning Models and standard LLMs are susceptible to Fake Reasoning Bias, where they favor flawed reasoning structures, as demonstrated by the THEATER benchmark, revealing vulnerabilities especially in subjective tasks and prompting limited mitigation.
Authors:Shreya Kadambi, Risheek Garrepalli, Shubhankar Borse, Munawar Hyatt, Fatih Porikli
Abstract:
Despite the remarkable success of diffusion models in text-to-image generation, their effectiveness in grounded visual editing and compositional control remains challenging. Motivated by advances in self-supervised learning and in-context generative modeling, we propose a series of simple yet powerful design choices that significantly enhance diffusion model capacity for structured, controllable generation and editing. We introduce Masking-Augmented Diffusion with Inference-Time Scaling (MADI), a framework that improves the editability, compositionality and controllability of diffusion models through two core innovations. First, we introduce Masking-Augmented gaussian Diffusion (MAgD), a novel training strategy with dual corruption process which combines standard denoising score matching and masked reconstruction by masking noisy input from forward process. MAgD encourages the model to learn discriminative and compositional visual representations, thus enabling localized and structure-aware editing. Second, we introduce an inference-time capacity scaling mechanism based on Pause Tokens, which act as special placeholders inserted into the prompt for increasing computational capacity at inference time. Our findings show that adopting expressive and dense prompts during training further enhances performance, particularly for MAgD. Together, these contributions in MADI substantially enhance the editability of diffusion models, paving the way toward their integration into more general-purpose, in-context generative diffusion architectures.
中文摘要:提出的MADI框架通过双重训练策略和推理时扩展机制,显著提升了扩散模型的可编辑性与组合性,实现了更结构化、可控的图像生成与编辑。
English Summary: The proposed MADI framework enhances diffusion models' editability and compositionality through a dual-training strategy and inference-time scaling, enabling more structured and controllable image generation and editing.
Authors:Charvi Rastogi, Tian Huey Teh, Pushkar Mishra, Roma Patel, Ding Wang, Mark DÃaz, Alicia Parrish, Aida Mostafazadeh Davani, Zoe Ashwood, Michela Paganini, Vinodkumar Prabhakaran, Verena Rieser, Lora Aroyo
Abstract:
Current text-to-image (T2I) models often fail to account for diverse human experiences, leading to misaligned systems. We advocate for pluralistic alignment, where an AI understands and is steerable towards diverse, and often conflicting, human values. Our work provides three core contributions to achieve this in T2I models. First, we introduce a novel dataset for Diverse Intersectional Visual Evaluation (DIVE) -- the first multimodal dataset for pluralistic alignment. It enable deep alignment to diverse safety perspectives through a large pool of demographically intersectional human raters who provided extensive feedback across 1000 prompts, with high replication, capturing nuanced safety perceptions. Second, we empirically confirm demographics as a crucial proxy for diverse viewpoints in this domain, revealing significant, context-dependent differences in harm perception that diverge from conventional evaluations. Finally, we discuss implications for building aligned T2I models, including efficient data collection strategies, LLM judgment capabilities, and model steerability towards diverse perspectives. This research offers foundational tools for more equitable and aligned T2I systems. Content Warning: The paper includes sensitive content that may be harmful.
中文: 本研究通过新型多模态数据集(DIVE)提出文本到图像模型的多元对齐框架,证实人口统计多样性对捕捉细微安全认知的关键作用,并为构建更公平的AI系统提供了基础工具。
English: This research introduces a pluralistic alignment framework for text-to-image models through a novel multimodal dataset (DIVE), demonstrating demographic diversity as crucial for capturing nuanced safety perceptions and providing tools for equitable AI systems.
Authors:Minghao Chen, Jianyuan Wang, Roman Shapovalov, Tom Monnier, Hyunyoung Jung, Dilin Wang, Rakesh Ranjan, Iro Laina, Andrea Vedaldi
Abstract:
We introduce AutoPartGen, a model that generates objects composed of 3D parts in an autoregressive manner. This model can take as input an image of an object, 2D masks of the object's parts, or an existing 3D object, and generate a corresponding compositional 3D reconstruction. Our approach builds upon 3DShape2VecSet, a recent latent 3D representation with powerful geometric expressiveness. We observe that this latent space exhibits strong compositional properties, making it particularly well-suited for part-based generation tasks. Specifically, AutoPartGen generates object parts autoregressively, predicting one part at a time while conditioning on previously generated parts and additional inputs, such as 2D images, masks, or 3D objects. This process continues until the model decides that all parts have been generated, thus determining automatically the type and number of parts. The resulting parts can be seamlessly assembled into coherent objects or scenes without requiring additional optimization. We evaluate both the overall 3D generation capabilities and the part-level generation quality of AutoPartGen, demonstrating that it achieves state-of-the-art performance in 3D part generation.
AutoPartGen是一种自回归模型,能够根据图像、部件掩码或现有3D对象逐部件生成组合式三维重建,在部件级生成任务中达到最优性能。
AutoPartGen is an autoregressive model that generates 3D objects part by part from inputs like images, masks, or existing 3D shapes, achieving state-of-the-art performance in compositional reconstruction.
Authors:Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Qingming Li, Tianyu Du, Shouling Ji
Abstract:
The rapid advancement of voice deepfake technologies has raised serious concerns about user audio privacy, as attackers increasingly exploit publicly available voice data to generate convincing fake audio for malicious purposes such as identity theft, financial fraud, and misinformation campaigns. While existing defense methods offer partial protection, they face critical limitations, including weak adaptability to unseen user data, poor scalability to long audio, rigid reliance on white-box knowledge, and high computational and temporal costs during the encryption process. To address these challenges and defend against personalized voice deepfake threats, we propose Enkidu, a novel user-oriented privacy-preserving framework that leverages universal frequential perturbations generated through black-box knowledge and few-shot training on a small amount of user data. These highly malleable frequency-domain noise patches enable real-time, lightweight protection with strong generalization across variable-length audio and robust resistance to voice deepfake attacks, all while preserving perceptual quality and speech intelligibility. Notably, Enkidu achieves over 50 to 200 times processing memory efficiency (as low as 0.004 gigabytes) and 3 to 7000 times runtime efficiency (real-time coefficient as low as 0.004) compared to six state-of-the-art countermeasures. Extensive experiments across six mainstream text-to-speech models and five cutting-edge automated speaker verification models demonstrate the effectiveness, transferability, and practicality of Enkidu in defending against both vanilla and adaptive voice deepfake attacks. Our code is currently available.
中文: 提出的Enkidu框架通过应用通用频域扰动,以实时轻量级的方式防御语音深度伪造攻击,在保持音频质量的同时实现了卓越的效率和鲁棒性。
English: The proposed Enkidu framework provides real-time, lightweight protection against voice deepfake attacks by applying universal frequential perturbations, achieving exceptional efficiency and robustness while preserving audio quality.
Authors:Penglei Sun, Yaoxian Song, Xiangru Zhu, Xiang Liu, Qiang Wang, Yue Liu, Changqun Xia, Tiefeng Li, Yang Yang, Xiaowen Chu
Abstract:
Scene understanding enables intelligent agents to interpret and comprehend their environment. While existing large vision-language models (LVLMs) for scene understanding have primarily focused on indoor household tasks, they face two significant limitations when applied to outdoor large-scale scene understanding. First, outdoor scenarios typically encompass larger-scale environments observed through various sensors from multiple viewpoints (e.g., bird view and terrestrial view), while existing indoor LVLMs mainly analyze single visual modalities within building-scale contexts from humanoid viewpoints. Second, existing LVLMs suffer from missing multidomain perception outdoor data and struggle to effectively integrate 2D and 3D visual information. To address the aforementioned limitations, we build the first multidomain perception outdoor scene understanding dataset, named \textbf{\underline{SVM-City}}, deriving from multi\textbf{\underline{S}}cale scenarios with multi\textbf{\underline{V}}iew and multi\textbf{\underline{M}}odal instruction tuning data. It contains $420$k images and $4, 811$M point clouds with $567$k question-answering pairs from vehicles, low-altitude drones, high-altitude aerial planes, and satellite. To effectively fuse the multimodal data in the absence of one modality, we introduce incomplete multimodal learning to model outdoor scene understanding and design the LVLM named \textbf{\underline{City-VLM}}. Multimodal fusion is realized by constructing a joint probabilistic distribution space rather than implementing directly explicit fusion operations (e.g., concatenation). Experimental results on three typical outdoor scene understanding tasks show City-VLM achieves $18.14 \%$ performance surpassing existing LVLMs in question-answering tasks averagely. Our method demonstrates pragmatic and generalization performance across multiple outdoor scenes.
中文: 现有大型视觉语言模型在室外场景理解中存在两大局限:局限于室内单视角数据且难以融合二维三维信息,为此我们构建了SVM-City多域感知数据集并开发了City-VLM模型,通过不完全多模态学习方法在多项室外任务中实现18.14%的性能提升。
English: Current large vision-language models for scene understanding are limited in outdoor applications due to their focus on indoor single-view data and inability to integrate 2D/3D information, prompting the development of the SVM-City dataset and City-VLM model that uses incomplete multimodal learning to achieve superior performance across various outdoor tasks.
Authors:Trong-Thang Pham, Akash Awasthi, Saba Khan, Esteban Duran Marti, Tien-Phat Nguyen, Khoa Vo, Minh Tran, Ngoc Son Nguyen, Cuong Tran Van, Yuki Ikebe, Anh Totti Nguyen, Anh Nguyen, Zhigang Deng, Carol C. Wu, Hien Van Nguyen, Ngan Le
Abstract:
Understanding radiologists' eye movement during Computed Tomography (CT) reading is crucial for developing effective interpretable computer-aided diagnosis systems. However, CT research in this area has been limited by the lack of publicly available eye-tracking datasets and the three-dimensional complexity of CT volumes. To address these challenges, we present the first publicly available eye gaze dataset on CT, called CT-ScanGaze. Then, we introduce CT-Searcher, a novel 3D scanpath predictor designed specifically to process CT volumes and generate radiologist-like 3D fixation sequences, overcoming the limitations of current scanpath predictors that only handle 2D inputs. Since deep learning models benefit from a pretraining step, we develop a pipeline that converts existing 2D gaze datasets into 3D gaze data to pretrain CT-Searcher. Through both qualitative and quantitative evaluations on CT-ScanGaze, we demonstrate the effectiveness of our approach and provide a comprehensive assessment framework for 3D scanpath prediction in medical imaging.
中文摘要:本研究提出了首个公开的CT眼动数据集CT-ScanGaze和新型三维扫描路径预测器CT-Searcher,通过将二维眼动数据转换为三维数据进行预训练,能生成类似放射科医生的注视序列,并经过综合评估验证了其有效性。
English Summary: This study introduces CT-ScanGaze, the first public eye-tracking dataset for CT scans, and CT-Searcher, a novel 3D scanpath predictor that generates radiologist-like gaze sequences through pretraining with converted 2D gaze data and demonstrates effectiveness via comprehensive evaluation.
Authors:Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan
Abstract:
Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.
中文摘要:MindJourney通过将视觉语言模型与视频扩散世界模型相结合,无需微调即可显著提升三维空间推理能力,在基准测试中平均性能提高超过8%。
English Summary: MindJourney enhances vision-language models' 3D spatial reasoning by integrating them with video diffusion world models, achieving over 8% performance improvement on benchmarks without requiring fine-tuning.
Authors:Trong-Thang Pham, Anh Nguyen, Zhigang Deng, Carol C. Wu, Hien Van Nguyen, Ngan Le
Abstract:
Radiologists rely on eye movements to navigate and interpret medical images. A trained radiologist possesses knowledge about the potential diseases that may be present in the images and, when searching, follows a mental checklist to locate them using their gaze. This is a key observation, yet existing models fail to capture the underlying intent behind each fixation. In this paper, we introduce a deep learning-based approach, RadGazeIntent, designed to model this behavior: having an intention to find something and actively searching for it. Our transformer-based architecture processes both the temporal and spatial dimensions of gaze data, transforming fine-grained fixation features into coarse, meaningful representations of diagnostic intent to interpret radiologists' goals. To capture the nuances of radiologists' varied intention-driven behaviors, we process existing medical eye-tracking datasets to create three intention-labeled subsets: RadSeq (Systematic Sequential Search), RadExplore (Uncertainty-driven Exploration), and RadHybrid (Hybrid Pattern). Experimental results demonstrate RadGazeIntent's ability to predict which findings radiologists are examining at specific moments, outperforming baseline methods across all intention-labeled datasets.
中文: RadGazeIntent深度学习模型通过分析放射科医师的眼动模式来解读其诊断意图,在预测医学图像阅片时的关注点方面优于现有方法。
English: RadGazeIntent is a deep learning model that interprets radiologists' diagnostic intentions by analyzing their gaze patterns, outperforming existing methods in predicting their focus during medical image review.
Authors:Gaofeng Li, Ruize Wang, Peisen Xu, Qi Ye, Jiming Chen
Abstract:
Achieving human-like dexterous robotic manipulation remains a central goal and a pivotal challenge in robotics. The development of Artificial Intelligence (AI) has allowed rapid progress in robotic manipulation. This survey summarizes the evolution of robotic manipulation from mechanical programming to embodied intelligence, alongside the transition from simple grippers to multi-fingered dexterous hands, outlining key characteristics and main challenges. Focusing on the current stage of embodied dexterous manipulation, we highlight recent advances in two critical areas: dexterous manipulation data collection (via simulation, human demonstrations, and teleoperation) and skill-learning frameworks (imitation and reinforcement learning). Then, based on the overview of the existing data collection paradigm and learning framework, three key challenges restricting the development of dexterous robotic manipulation are summarized and discussed.
中文摘要:本综述追溯了机器人操作从机械编程到具身智能的演变历程,重点阐述了灵巧操作数据采集与技能学习框架的最新进展,并指出了制约该领域发展的三大关键挑战。
English Summary: This survey traces the evolution of robotic manipulation from mechanical programming to embodied intelligence, highlighting recent advances in dexterous manipulation data collection and skill-learning frameworks while identifying three key challenges hindering further development.
Authors:Yuehao Huang, Liang Liu, Shuangming Lei, Yukai Ma, Hao Su, Jianbiao Mei, Pengxiang Zhao, Yaqing Gu, Yong Liu, Jiajun Lv
Abstract:
Mobile robots are increasingly required to navigate and interact within unknown and unstructured environments to meet human demands. Demand-driven navigation (DDN) enables robots to identify and locate objects based on implicit human intent, even when object locations are unknown. However, traditional data-driven DDN methods rely on pre-collected data for model training and decision-making, limiting their generalization capability in unseen scenarios. In this paper, we propose CogDDN, a VLM-based framework that emulates the human cognitive and learning mechanisms by integrating fast and slow thinking systems and selectively identifying key objects essential to fulfilling user demands. CogDDN identifies appropriate target objects by semantically aligning detected objects with the given instructions. Furthermore, it incorporates a dual-process decision-making module, comprising a Heuristic Process for rapid, efficient decisions and an Analytic Process that analyzes past errors, accumulates them in a knowledge base, and continuously improves performance. Chain of Thought (CoT) reasoning strengthens the decision-making process. Extensive closed-loop evaluations on the AI2Thor simulator with the ProcThor dataset show that CogDDN outperforms single-view camera-only methods by 15\%, demonstrating significant improvements in navigation accuracy and adaptability. The project page is available at https://yuehaohuang.github.io/CogDDN/.
中文:CogDDN是一种基于视觉语言模型的认知驱动导航框架,通过模拟人类快慢思维决策机制,显著提升了机器人在未知环境中的导航准确性和适应性。
English: CogDDN is a cognitive-driven navigation framework that uses visual language models to emulate human thinking, integrating fast and slow decision processes to improve robot adaptability and accuracy in unknown environments.
Authors:Zewen Bai, Liang Yang, Shengdi Yin, Yuanyuan Sun, Hongfei Lin
Abstract:
The proliferation of hate speech has inflicted significant societal harm, with its intensity and directionality closely tied to specific targets and arguments. In recent years, numerous machine learning-based methods have been developed to detect hateful comments on online platforms automatically. However, research on Chinese hate speech detection lags behind, and interpretability studies face two major challenges: first, the scarcity of span-level fine-grained annotated datasets limits models' deep semantic understanding of hate speech; second, insufficient research on identifying and interpreting coded hate speech restricts model explainability in complex real-world scenarios. To address these, we make the following contributions: (1) We introduce the Span-level Target-Aware Toxicity Extraction dataset (STATE ToxiCN), the first span-level Chinese hate speech dataset, and evaluate the hate semantic understanding of existing models using it. (2) We conduct the first comprehensive study on Chinese coded hate terms, LLMs' ability to interpret hate semantics. (3) We propose a method to integrate an annotated lexicon into models, significantly enhancing hate speech detection performance. Our work provides valuable resources and insights to advance the interpretability of Chinese hate speech detection research.
中文:本研究通过推出首个中文细粒度仇恨言论标注数据集,系统分析隐晦仇恨用语,并提出词典融合方法,显著提升了中文仇恨言论检测的可解释性与检测效能。
English: This study addresses gaps in Chinese hate speech detection by introducing the first span-level annotated dataset, analyzing coded hate terms, and proposing a lexicon-integrated method to enhance model interpretability and performance.
Authors:Luca Beber, Edoardo Lamon, Giacomo Moretti, Matteo Saveriano, Luca Fambri, Luigi Palopoli, Daniele Fontanelli
Abstract:
Diagnostic activities, such as ultrasound scans and palpation, are relatively low-cost. They play a crucial role in the early detection of health problems and in assessing their progression. However, they are also error-prone activities, which require highly skilled medical staff. The use of robotic solutions can be key to decreasing the inherent subjectivity of the results and reducing the waiting list. For a robot to perform palpation or ultrasound scans, it must effectively manage physical interactions with the human body, which greatly benefits from precise estimation of the patient's tissue biomechanical properties. This paper assesses the accuracy and precision of a robotic system in estimating the viscoelastic parameters of various materials, including some tests on ex vivo tissues as a preliminary proof-of-concept demonstration of the method's applicability to biological samples. The measurements are compared against a ground truth derived from silicone specimens with different viscoelastic properties, characterised using a high-precision instrument. Experimental results show that the robotic system's accuracy closely matches the ground truth, increasing confidence in the potential use of robots for such clinical applications.
中文: 机器人系统通过精确估计组织生物力学特性,有望减少触诊和超声等诊断过程中的错误和主观性,实验结果显示其在测量粘弹性参数方面与基准值高度吻合,增强了其在临床应用中的潜力。
English: Robotic systems offer a promising solution to reduce errors and subjectivity in diagnostic procedures like palpation and ultrasound by accurately estimating tissue biomechanical properties, with experimental results demonstrating high accuracy in viscoelastic parameter measurements compared to ground truth.
Authors:Shuchang Ye, Usman Naseem, Mingyuan Meng, Jinman Kim
Abstract:
Medical language-guided segmentation, integrating textual clinical reports as auxiliary guidance to enhance image segmentation, has demonstrated significant improvements over unimodal approaches. However, its inherent reliance on paired image-text input, which we refer to as ``textual reliance", presents two fundamental limitations: 1) many medical segmentation datasets lack paired reports, leaving a substantial portion of image-only data underutilized for training; and 2) inference is limited to retrospective analysis of cases with paired reports, limiting its applicability in most clinical scenarios where segmentation typically precedes reporting. To address these limitations, we propose ProLearn, the first Prototype-driven Learning framework for language-guided segmentation that fundamentally alleviates textual reliance. At its core, we introduce a novel Prototype-driven Semantic Approximation (PSA) module to enable approximation of semantic guidance from textual input. PSA initializes a discrete and compact prototype space by distilling segmentation-relevant semantics from textual reports. Once initialized, it supports a query-and-respond mechanism which approximates semantic guidance for images without textual input, thereby alleviating textual reliance. Extensive experiments on QaTa-COV19, MosMedData+ and Kvasir-SEG demonstrate that ProLearn outperforms state-of-the-art language-guided methods when limited text is available.
中文摘要:ProLearn是一种原型驱动学习框架,通过从无配对报告的图像中近似语义指导,克服了医学图像分割对文本依赖的限制,在文本数据稀缺时仍能提升分割性能。
English Summary: ProLearn is a prototype-driven learning framework that overcomes the limitations of text-dependent medical image segmentation by approximating semantic guidance from images without paired reports, enhancing performance even with scarce textual data.
Authors:Yushun Fang, Lu Liu, Xiang Gao, Qiang Hu, Ning Cao, Jianghe Cui, Gang Chen, Xiaoyun Zhang
Abstract:
The latest developments in Face Restoration have yielded significant advancements in visual quality through the utilization of diverse diffusion priors. Nevertheless, the uncertainty of face identity introduced by identity-obscure inputs and stochastic generative processes remains unresolved. To address this challenge, we present Robust ID-Specific Face Restoration (RIDFR), a novel ID-specific face restoration framework based on diffusion models. Specifically, RIDFR leverages a pre-trained diffusion model in conjunction with two parallel conditioning modules. The Content Injection Module inputs the severely degraded image, while the Identity Injection Module integrates the specific identity from a given image. Subsequently, RIDFR incorporates Alignment Learning, which aligns the restoration results from multiple references with the same identity in order to suppress the interference of ID-irrelevant face semantics (e.g. pose, expression, make-up, hair style). Experiments demonstrate that our framework outperforms the state-of-the-art methods, reconstructing high-quality ID-specific results with high identity fidelity and demonstrating strong robustness.
Chinese: 提出的鲁棒身份特定人脸恢复(RIDFR)框架采用扩散模型,结合双条件模块和对齐学习,有效解决了身份模糊输入问题,实现了高保真身份还原和强鲁棒性的先进人脸恢复效果。
English: The proposed Robust ID-Specific Face Restoration (RIDFR) framework utilizes diffusion models with dual conditioning modules and alignment learning to achieve superior face restoration with high identity fidelity and robustness against identity-obscure inputs.
Authors:Chen Su, Yuanhe Tian, Qinyu Liu, Jun Zhang, Yan Song
Abstract:
Recently, large language models (LLMs) have demonstrated powerful capabilities in performing various tasks and thus are applied by recent studies to time series forecasting (TSF) tasks, which predict future values with the given historical time series. Existing LLM-based approaches transfer knowledge learned from text data to time series prediction using prompting or fine-tuning strategies. However, LLMs are proficient at reasoning over discrete tokens and semantic patterns but are not initially designed to model continuous numerical time series data. The gaps between text and time series data lead LLMs to achieve inferior performance to a vanilla Transformer model that is directly trained on TSF data. However, the vanilla Transformers often struggle to learn high-level semantic patterns. In this paper, we design a novel Transformer-based architecture that complementarily leverages LLMs and vanilla Transformers, so as to integrate the high-level semantic representations learned by LLMs into the temporal information encoded by time series Transformers, where a hybrid representation is obtained by fusing the representations from the LLM and the Transformer. The resulting fused representation contains both historical temporal dynamics and semantic variation patterns, allowing our model to predict more accurate future values. Experiments on benchmark datasets demonstrate the effectiveness of the proposed approach.
中文: 本文提出了一种新颖的基于Transformer的架构,通过协同结合大语言模型和普通Transformer,将高层次语义模式与时间动态信息相融合以改进时间序列预测,在基准数据集上实现了更优性能。
English: This paper introduces a novel Transformer-based architecture that synergistically combines large language models (LLMs) and vanilla Transformers to enhance time series forecasting by integrating high-level semantic patterns with temporal dynamics, achieving superior performance on benchmark datasets.
Authors:Chenxi Huang, Shaotian Yan, Liang Xie, Binbin Lin, Sinan Fan, Yue Xin, Deng Cai, Chen Shen, Jieping Ye
Abstract:
Representation Fine-tuning (ReFT), a recently proposed Parameter-Efficient Fine-Tuning (PEFT) method, has attracted widespread attention for significantly improving parameter efficiency by editing representation space alone. In this work, we investigate applying ReFT to complex reasoning tasks. However, directly using the native ReFT method, which modifies fixed representations at the beginning and end of each layer, yields suboptimal performance, as these fixed-position representations have uncertain impact on the outputs. We observe that, in complex reasoning tasks, there often exist certain critical representations. These representations either integrate significant information from preceding layers or regulate subsequent layer representations. Through layer-by-layer propagation, they exert a substantial influence on the final output. Naturally, fine-tuning these critical representations has the potential to greatly enhance reasoning performance. Building upon these insights, we propose Critical Representation Fine-Tuning (CRFT), a novel method that identifies and optimizes these critical representations through information flow analysis. CRFT operates within a supervised learning framework, dynamically optimizing critical representations in a low-rank linear subspace while freezing the base model. The effectiveness and efficiency of our method are validated across eight benchmarks for arithmetic and commonsense reasoning, using LLaMA and Mistral model families. Furthermore, our method also adapts effectively to few-shot settings, boosting one-shot accuracy by 16.4%. Our work highlights the untapped potential of representation-level optimization for CoT reasoning, offering a lightweight yet powerful alternative to traditional PEFT methods.
中文: 本文提出关键表征微调(CRFT)方法,通过信息流分析识别并动态优化关键表征,在多个推理基准测试中展现出卓越的效能与效率,同时能有效适应少样本学习场景。
English: This paper introduces Critical Representation Fine-Tuning (CRFT), a novel method that enhances reasoning performance by dynamically optimizing critical representations identified through information flow analysis, achieving superior efficiency and effectiveness across multiple benchmarks while adapting well to few-shot settings.
Authors:Weiyu Chen, Chengjie Liu, Wenhao Huang, Jinyang Lyu, Mingqian Yang, Yuan Du, Li Du, Jun Yang
Abstract:
Recent advancements have demonstrated the significant potential of large language models (LLMs) in analog circuit design. Nevertheless, testbench construction for analog circuits remains manual, creating a critical bottleneck in achieving fully automated design processes. Particularly when replicating circuit designs from academic papers, manual Testbench construction demands time-intensive implementation and frequent adjustments, which fails to address the dynamic diversity and flexibility requirements for automation. AnalogTester tackles automated analog design challenges through an LLM-powered pipeline: a) domain-knowledge integration, b) paper information extraction, c) simulation scheme synthesis, and d) testbench code generation with Tsinghua Electronic Design (TED). AnalogTester has demonstrated automated Testbench generation capabilities for three fundamental analog circuit types: operational amplifiers (op-amps), bandgap references (BGRs), and low-dropout regulators (LDOs), while maintaining a scalable framework for adaptation to broader circuit topologies. Furthermore, AnalogTester can generate circuit knowledge data and TED code corpus, establishing fundamental training datasets for LLM specialization in analog circuit design automation.
中文摘要:AnalogTester通过大语言模型驱动的流程实现了模拟电路测试平台的自动化生成,突破了手动瓶颈,支持三种基本电路类型并创建了专用训练数据集。
English Summary: AnalogTester utilizes a large language model pipeline to automate analog circuit testbench generation, overcoming manual bottlenecks and supporting three fundamental circuit types while creating specialized training datasets.
Authors:Junjie Liu, Yuanhe Tian, Yan Song
Abstract:
Aspect-based sentiment analysis (ABSA) is a crucial fine-grained task in social media scenarios to identify the sentiment polarity of specific aspect terms in a sentence. Although many existing studies leverage large language models (LLMs) to perform ABSA due to their strong context understanding capabilities, they still face challenges to learn the context information in the running text because of the short text, as well as the small and unbalanced labeled training data, where most data are labeled with positive sentiment. Data augmentation (DA) is a feasible strategy for providing richer contextual information, especially when using LLMs to create synthetic training data, but faces challenges in ensuring a high quality of the augmented data.In this paper, we propose an LLM-based ABSA approach with training data augmentation.Specifically, an LLM is prompted to generate augmented training data based on the original training data, so as to construct a new training data with larger size and balanced label distributions to better train an ABSA model. Meanwhile, in order to improve the quality of the augmented data, we propose a reinforcement learning approach to optimize the data augmentation. LLM.Experiment results and further analyses on English benchmark datasets for ABSA demonstrate the effectiveness of our approach, where superior performance is observed over strong baselines and most existing studies.
中文摘要:本研究提出了一种基于大语言模型的数据增强方法,通过生成平衡且高质量的训练数据来改进方面级情感分析,并采用强化学习优化增强质量,实验证明其性能优于现有方法。
English Summary: This study introduces a data augmentation method using large language models to generate balanced, high-quality training data for aspect-based sentiment analysis, enhanced by reinforcement learning to optimize augmentation quality and improve model performance over existing approaches.
Authors:Gustavo A. Oliva, Gopi Krishnan Rajbahadur, Aaditya Bhatia, Haoxiang Zhang, Yihao Chen, Zhilong Chen, Arthur Leung, Dayi Lin, Boyuan Chen, Ahmed E. Hassan
Abstract:
High-quality labeled datasets are crucial for training and evaluating foundation models in software engineering, but creating them is often prohibitively expensive and labor-intensive. We introduce SPICE, a scalable, automated pipeline for labeling SWE-bench-style datasets with annotations for issue clarity, test coverage, and effort estimation. SPICE combines context-aware code navigation, rationale-driven prompting, and multi-pass consensus to produce labels that closely approximate expert annotations. SPICE's design was informed by our own experience and frustration in labeling more than 800 instances from SWE-Gym. SPICE achieves strong agreement with human-labeled SWE-bench Verified data while reducing the cost of labeling 1,000 instances from around \$100,000 (manual annotation) to just \$5.10. These results demonstrate SPICE's potential to enable cost-effective, large-scale dataset creation for SE-focused FMs. To support the community, we release both SPICE tool and SPICE Bench, a new dataset of 6,802 SPICE-labeled instances curated from 291 open-source projects in SWE-Gym (over 13x larger than SWE-bench Verified).
中文: SPICE是一种自动化标注流水线,能以接近专家水准的精度高效标注软件工程数据集,将每千条数据的标注成本从10万美元降至5.1美元,同时支持大规模数据集构建。
English: SPICE is an automated pipeline that efficiently labels software engineering datasets with expert-level accuracy, reducing annotation costs from $100,000 to just $5.10 per 1,000 instances while enabling large-scale dataset creation.
Authors:Karishma Battina, Prathamesh Dinesh Joshi, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Abstract:
Traditional numerical methods often struggle with the complexity and scale of modeling pollutant transport across vast and dynamic oceanic domains. This paper introduces a Physics-Informed Neural Network (PINN) framework to simulate the dispersion of pollutants governed by the 2D advection-diffusion equation. The model achieves physically consistent predictions by embedding physical laws and fitting to noisy synthetic data, generated via a finite difference method (FDM), directly into the neural network training process. This approach addresses challenges such as non-linear dynamics and the enforcement of boundary and initial conditions. Synthetic data sets, augmented with varying noise levels, are used to capture real-world variability. The training incorporates a hybrid loss function including PDE residuals, boundary/initial condition conformity, and a weighted data fit term. The approach takes advantage of the Julia language scientific computing ecosystem for high-performance simulations, offering a scalable and flexible alternative to traditional solvers
中文: 本文提出了一种物理信息神经网络框架,通过嵌入物理定律并处理含噪声数据,有效模拟海洋污染物扩散,为传统数值方法提供了可扩展的替代方案。
English: This paper presents a Physics-Informed Neural Network (PINN) framework that effectively simulates pollutant dispersion in oceans by embedding physical laws and handling noisy data, providing a scalable alternative to traditional numerical methods.
Authors:Perry Dong, Qiyang Li, Dorsa Sadigh, Chelsea Finn
Abstract:
We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.
中文: 本研究提出表达性策略优化(EXPO),一种在线强化学习算法,通过结合稳定的表达性基础策略和轻量级高斯编辑策略来最大化价值,无需直接优化,样本效率较现有方法提升高达2-3倍。
English: The study introduces Expressive Policy Optimization (EXPO), an online RL algorithm that enhances sample efficiency by combining a stable expressive base policy with a lightweight Gaussian edit policy to maximize value without direct optimization, achieving up to 2-3x improvements over prior methods.
Authors:Xinyu Li, Tongguang Li, Lixiang Yan, Yuheng Li, Linxuan Zhao, Mladen RakoviÄ, Inge Molenaar, Dragan GaÅ¡eviÄ, Yizhou Fan
Abstract:
SRL, defined as learners' ability to systematically plan, monitor, and regulate their learning activities, is crucial for sustained academic achievement and lifelong learning competencies. Emerging Artificial Intelligence (AI) developments profoundly influence SRL interactions by potentially either diminishing or strengthening learners' opportunities to exercise their own regulatory skills. Recent literature emphasizes a balanced approach termed Hybrid Human-AI Regulated Learning (HHAIRL), in which AI provides targeted, timely scaffolding while preserving the learners' role as active decision-makers and reflective monitors of their learning process. Nevertheless, existing digital tools frequently fall short, lacking adaptability, focusing narrowly on isolated SRL phases, and insufficiently support meaningful human-AI interactions. In response, this paper introduces the enhanced FLoRA Engine, which incorporates advanced Generative Artificial Intelligence (GenAI) features and state-of-the-art learning analytics, explicitly grounded in SRL and HHAIRL theories. The FLoRA Engine offers instrumentation tools such as collaborative writing, multi-agents chatbot, and detailed learning trace logging to support dynamic, adaptive scaffolding tailored to individual needs in real time. We further present a summary of several research studies that provide the validations for and illustrate how these instrumentation tools can be utilized in real-world educational and experimental contexts. These studies demonstrate the effectiveness of FLoRA Engine in fostering SRL and HHAIRL, providing both theoretical insights and practical solutions for the future of AI-enhanced learning context.
中文:增强版FLoRA引擎融合生成式人工智能与学习分析技术,通过动态自适应支架支持自主调节学习与人机协同教育,多项研究已验证其有效性。
English: The enhanced FLoRA Engine integrates Generative AI and learning analytics to provide adaptive, real-time scaffolding that supports self-regulated learning and human-AI collaboration in education, as validated by multiple studies.
Authors:Karthik Pappu, Prathamesh Dinesh Joshi, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Abstract:
Accurately modeling malware propagation is essential for designing effective cybersecurity defenses, particularly against adaptive threats that evolve in real time. While traditional epidemiological models and recent neural approaches offer useful foundations, they often fail to fully capture the nonlinear feedback mechanisms present in real-world networks. In this work, we apply scientific machine learning to malware modeling by evaluating three approaches: classical Ordinary Differential Equations (ODEs), Universal Differential Equations (UDEs), and Neural ODEs. Using data from the Code Red worm outbreak, we show that the UDE approach substantially reduces prediction error compared to both traditional and neural baselines by 44%, while preserving interpretability. We introduce a symbolic recovery method that transforms the learned neural feedback into explicit mathematical expressions, revealing suppression mechanisms such as network saturation, security response, and malware variant evolution. Our results demonstrate that hybrid physics-informed models can outperform both purely analytical and purely neural approaches, offering improved predictive accuracy and deeper insight into the dynamics of malware spread. These findings support the development of early warning systems, efficient outbreak response strategies, and targeted cyber defense interventions.
中文摘要:本研究表明,融合物理信息的混合模型(特别是通用微分方程方法)能通过符号化还原网络动态机制,在保持模型可解释性的同时将恶意软件传播预测准确率显著提升44%。
English Summary: This study demonstrates that hybrid physics-informed models, particularly Universal Differential Equations, significantly enhance malware propagation prediction accuracy by 44% while maintaining interpretability through symbolic recovery of network dynamics.
Authors:Sunwoo Kim, Haneul Yoo, Alice Oh
Abstract:
Understanding how large language models (LLMs) internally represent and process their predictions is central to detecting uncertainty and preventing hallucinations. While several studies have shown that models encode uncertainty in their hidden states, it is underexplored how this affects the way they process such hidden states. In this work, we demonstrate that the dynamics of output token probabilities across layers for certain and uncertain outputs are largely aligned, revealing that uncertainty does not seem to affect inference dynamics. Specifically, we use the Tuned Lens, a variant of the Logit Lens, to analyze the layer-wise probability trajectories of final prediction tokens across 11 datasets and 5 models. Using incorrect predictions as those with higher epistemic uncertainty, our results show aligned trajectories for certain and uncertain predictions that both observe abrupt increases in confidence at similar layers. We balance this finding by showing evidence that more competent models may learn to process uncertainty differently. Our findings challenge the feasibility of leveraging simplistic methods for detecting uncertainty at inference. More broadly, our work demonstrates how interpretability methods may be used to investigate the way uncertainty affects inference.
中文: 本研究表明,大型语言模型中的不确定性不会显著改变推理动态,因为确定和不确定的预测在跨层时都表现出相似的置信度轨迹,这对简单的确定性检测方法提出了挑战。
English: This study reveals that uncertainty in large language models does not significantly alter inference dynamics, as both certain and uncertain predictions show similar confidence trajectories across layers, challenging simplistic uncertainty detection methods.
Authors:Wenxiang Guo, Yu Zhang, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Zhetao Chen, Wenhao Xu, Fei Wu, Zhou Zhao
Abstract:
Recent breakthroughs in singing voice synthesis (SVS) have heightened the demand for high-quality annotated datasets, yet manual annotation remains prohibitively labor-intensive and resource-intensive. Existing automatic singing annotation (ASA) methods, however, primarily tackle isolated aspects of the annotation pipeline. To address this fundamental challenge, we present STARS, which is, to our knowledge, the first unified framework that simultaneously addresses singing transcription, alignment, and refined style annotation. Our framework delivers comprehensive multi-level annotations encompassing: (1) precise phoneme-audio alignment, (2) robust note transcription and temporal localization, (3) expressive vocal technique identification, and (4) global stylistic characterization including emotion and pace. The proposed architecture employs hierarchical acoustic feature processing across frame, word, phoneme, note, and sentence levels. The novel non-autoregressive local acoustic encoders enable structured hierarchical representation learning. Experimental validation confirms the framework's superior performance across multiple evaluation dimensions compared to existing annotation approaches. Furthermore, applications in SVS training demonstrate that models utilizing STARS-annotated data achieve significantly enhanced perceptual naturalness and precise style control. This work not only overcomes critical scalability challenges in the creation of singing datasets but also pioneers new methodologies for controllable singing voice synthesis. Audio samples are available at https://gwx314.github.io/stars-demo/.
中文摘要:STARS是首个统一处理歌唱转录、对齐和风格标注的框架,解决了歌唱数据集创建的可扩展性难题,显著提升了合成歌声的自然度与风格控制能力。
English Summary: STARS is the first unified framework for simultaneous singing transcription, alignment, and style annotation, overcoming scalability challenges in dataset creation and enabling enhanced naturalness and style control in singing voice synthesis.
Authors:Hanyang Peng, Shuang Qin, Yue Yu, Fangqing Jiang, Hui Wang, Wen Gao
Abstract:
Adam has proven remarkable successful in training deep neural networks, but the mechanisms underlying its empirical successes and limitations remain underexplored. In this study, we demonstrate that the effectiveness of Adam stems largely from its similarity to SignSGD in robustly handling large gradient fluctuations, yet it is also vulnerable to destabilizing loss spikes due to its uncontrolled update scaling. To enhance the advantage of Adam and mitigate its limitation, we propose SignSoftSGD (S3), a novel optimizer with three key innovations. \emph{First}, S3 generalizes the sign-like update by employing a flexible $p$-th order momentum ($p \geq 1$) in the denominator, departing from the conventional second-order momentum (variance) preconditioning. This design enables enhanced performance while achieving stable training even with aggressive learning rates. \emph{Second}, S3 minimizes the occurrences of loss spikes through unified exponential moving average coefficients for numerator and denominator momenta, which inherently bound updates to $[-1, 1]$ and simplify hyperparameter tuning. \emph{Third}, S3 incorporates an equivalent Nesterov's accelerated gradient(NAG) module, accelerating convergence without memory overhead. Theoretically, we prove that S3 achieves the optimal convergence rate of $O\left(\frac{1}{T^{\sfrac{1}{4}}}\right)$ for general nonconvex stochastic optimization under weak assumptions. Extensive experiments across a range of vision and language tasks show that \textsf{\small S3} not only converges more rapidly and improves performance but also rarely experiences loss spikes, even with a \textbf{$\bm{10 \times}$} larger learning rate. In fact, S3 delivers performance comparable to or better than AdamW with \textbf{$2 \times$} the training steps, establishing its efficacy in both efficiency and final task performance.
中文: 研究表明Adam的成功源于其类似符号更新的方式处理大梯度,但存在不稳定性,因此提出了SignSoftSGD(S3),该优化器通过创新设计提升了性能,确保了训练稳定性,并在高学习率下避免损失尖峰,加速收敛。
English: The study reveals that Adam's success is due to its sign-like update handling large gradients, but it suffers from instability, leading to the development of SignSoftSGD (S3), which enhances performance, ensures stable training, and accelerates convergence without loss spikes even at high learning rates.
Authors:Changchun Yang, Haoyang Li, Yushuai Wu, Yilan Zhang, Yifeng Jiao, Yu Zhang, Rihan Huang, Yuan Cheng, Yuan Qi, Xin Guo, Xin Gao
Abstract:
While pathology foundation models have transformed cancer image analysis, they often lack integration with molecular data at single-cell resolution, limiting their utility for precision oncology. Here, we present PAST, a pan-cancer single-cell foundation model trained on 20 million paired histopathology images and single-cell transcriptomes spanning multiple tumor types and tissue contexts. By jointly encoding cellular morphology and gene expression, PAST learns unified cross-modal representations that capture both spatial and molecular heterogeneity at the cellular level. This approach enables accurate prediction of single-cell gene expression, virtual molecular staining, and multimodal survival analysis directly from routine pathology slides. Across diverse cancers and downstream tasks, PAST consistently exceeds the performance of existing approaches, demonstrating robust generalizability and scalability. Our work establishes a new paradigm for pathology foundation models, providing a versatile tool for high-resolution spatial omics, mechanistic discovery, and precision cancer research.
中文: PAST是一种泛癌单细胞基础模型,通过整合组织病理学图像与单细胞转录组数据,学习统一的跨模态表征,能够实现单细胞基因表达预测、虚拟分子染色和多模态生存分析,在多种癌症任务中均优于现有方法。
English: PAST is a pan-cancer single-cell foundation model that integrates histopathology images with single-cell transcriptomes to create unified cross-modal representations, enabling accurate gene expression prediction, virtual molecular staining, and survival analysis while outperforming existing methods across diverse cancer tasks.
Authors:Shuning Zhang, Hui Wang, Xin Yi
Abstract:
As Artificial Intelligence (AI) increasingly becomes an active collaborator in co-creation, understanding the distribution and dynamic of agency is paramount. The Human-Computer Interaction (HCI) perspective is crucial for this analysis, as it uniquely reveals the interaction dynamics and specific control mechanisms that dictate how agency manifests in practice. Despite this importance, a systematic synthesis mapping agency configurations and control mechanisms within the HCI/CSCW literature is lacking. Addressing this gap, we reviewed 134 papers from top-tier HCI/CSCW venues (e.g., CHI, UIST, CSCW) over the past 20 years. This review yields four primary contributions: (1) an integrated theoretical framework structuring agency patterns, control mechanisms, and interaction contexts, (2) a comprehensive operational catalog of control mechanisms detailing how agency is implemented; (3) an actionable cross-context map linking agency configurations to diverse co-creative practices; and (4) grounded implications and guidance for future CSCW research and the design of co-creative systems, addressing aspects like trust and ethics.
中文摘要:本研究通过分析134篇人机交互文献,构建了人类与AI协同创作中的代理权配置与控制机制框架,为未来协同系统的设计与伦理考量提供了实践指导。
English Summary: This study synthesizes 134 HCI/CSCW papers to develop a framework mapping agency configurations and control mechanisms in human-AI co-creation, offering practical guidance for future system design.
Authors:Shuning Zhang, Hui Wang, Xin Yi
Abstract:
As Artificial Intelligence (AI) increasingly becomes an active collaborator in co-creation, understanding the distribution and dynamic of agency is paramount. The Human-Computer Interaction (HCI) perspective is crucial for this analysis, as it uniquely reveals the interaction dynamics and specific control mechanisms that dictate how agency manifests in practice. Despite this importance, a systematic synthesis mapping agency configurations and control mechanisms within the HCI/CSCW literature is lacking. Addressing this gap, we reviewed 134 papers from top-tier HCI/CSCW venues (e.g., CHI, UIST, CSCW) over the past 20 years. This review yields four primary contributions: (1) an integrated theoretical framework structuring agency patterns, control mechanisms, and interaction contexts, (2) a comprehensive operational catalog of control mechanisms detailing how agency is implemented; (3) an actionable cross-context map linking agency configurations to diverse co-creative practices; and (4) grounded implications and guidance for future CSCW research and the design of co-creative systems, addressing aspects like trust and ethics.
中文摘要:本研究通过分析134篇人机交互文献,构建了人类与AI协同创作中的代理权配置与控制机制框架,为未来协同系统的设计与伦理考量提供了实践指导。
English Summary: This study synthesizes 134 HCI/CSCW papers to develop a framework mapping agency configurations and control mechanisms in human-AI co-creation, offering practical guidance for future system design.
Authors:Hanyang Peng, Shuang Qin, Yue Yu, Fangqing Jiang, Hui Wang, Zhouchen Lin
Abstract:
Adam is widely recognized as one of the most effective optimizers for training deep neural networks (DNNs). Despite its remarkable empirical success, its theoretical convergence analysis remains unsatisfactory. Existing works predominantly interpret Adam as a preconditioned stochastic gradient descent with momentum (SGDM), formulated as $\bm{x}_{t+1} = \bm{x}_t - \frac{γ_t}{{\sqrt{\bm{v}_t}+ε}} \circ \bm{m}_t$. This perspective necessitates strong assumptions and intricate techniques, resulting in lengthy and opaque convergence proofs that are difficult to verify and extend. In contrast, we propose a novel interpretation by treating Adam as a sign-like optimizer, expressed as $\bm{x}_{t+1} = \bm{x}_t - γ_t \frac{|\bm{m}_t|}{{\sqrt{\bm{v}_t}+ε}} \circ {\rm Sign}(\bm{m}_t)$. This reformulation significantly simplifies the convergence analysis. For the first time, with some mild conditions, we prove that Adam achieves the optimal rate of ${\cal O}(\frac{1}{T^{\sfrac{1}{4}}})$ rather than the previous ${\cal O} \left(\frac{\ln T}{T^{\sfrac{1}{4}}}\right)$ under weak assumptions of the generalized $p$-affine variance and $(L_0, L_1, q)$-smoothness, without dependence on the model dimensionality or the numerical stability parameter $ε$. Additionally, our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence and offers practical guidelines for tuning learning rates in Adam, further bridging the gap between theory and practice.
中文: 该摘要提出将Adam视为符号类优化器的新解释,简化了收敛性分析并证明其在温和条件下达到最优收敛速率,同时揭示了动量的关键作用并提供了实际调参指导。
English: The abstract proposes a novel interpretation of Adam as a sign-like optimizer, which simplifies convergence analysis and proves it achieves an optimal convergence rate under mild conditions, offering insights into momentum's role and practical tuning guidelines.
Authors:Yujie Hu, Xuanyu Zhang, Weiqi Li, Jian Zhang
Abstract:
Virtual try-on has made significant progress in recent years. This paper addresses how to achieve multifunctional virtual try-on guided solely by text instructions, including full outfit change and local editing. Previous methods primarily relied on end-to-end networks to perform single try-on tasks, lacking versatility and flexibility. We propose TalkFashion, an intelligent try-on assistant that leverages the powerful comprehension capabilities of large language models to analyze user instructions and determine which task to execute, thereby activating different processing pipelines accordingly. Additionally, we introduce an instruction-based local repainting model that eliminates the need for users to manually provide masks. With the help of multi-modal models, this approach achieves fully automated local editings, enhancing the flexibility of editing tasks. The experimental results demonstrate better semantic consistency and visual quality compared to the current methods.
中文: 本文提出TalkFashion智能试衣助手,利用大语言模型解析文本指令实现多功能虚拟试衣和自动化局部编辑,相比现有方法在语义一致性和视觉质量上表现更优。
English: This paper introduces TalkFashion, an intelligent virtual try-on assistant that uses large language models to interpret text instructions for versatile outfit changes and automated local editing, achieving superior semantic consistency and visual quality over existing methods.
Authors:Mohammad Zia Ur Rehman, Aditya Shah, Nagendra Kumar
Abstract:
The global reach of social media has amplified the spread of hateful content, including implicit sexism, which is often overlooked by conventional detection methods. In this work, we introduce an Adaptive Supervised Contrastive lEarning framework for implicit sexism detectioN (ASCEND). A key innovation of our method is the incorporation of threshold-based contrastive learning: by computing cosine similarities between embeddings, we selectively treat only those sample pairs as positive if their similarity exceeds a learnable threshold. This mechanism refines the embedding space by robustly pulling together representations of semantically similar texts while pushing apart dissimilar ones, thus reducing false positives and negatives. The final classification is achieved by jointly optimizing a contrastive loss with a cross-entropy loss. Textual features are enhanced through a word-level attention module. Additionally, we employ sentiment, emotion, and toxicity features. Evaluations on the EXIST2021 and MLSC datasets demonstrate that ASCEND significantly outperforms existing methods, with average Macro F1 improvements of 9.86%, 29.63%, and 32.51% across multiple tasks, highlighting its efficacy in capturing the subtle cues of implicit sexist language.
Chinese: 本文提出ASCEND自适应监督对比学习框架,通过阈值对比学习和综合文本特征增强隐性性别歧视检测,相比现有方法实现了显著性能提升。
English: This paper introduces ASCEND, an adaptive supervised contrastive learning framework that enhances implicit sexism detection by using threshold-based contrastive learning and integrated textual features, achieving significant performance improvements over existing methods.
Authors:Yixiao Ge, Giulio Delama, Martin Scheiber, Alessandro Fornasier, Pieter van Goor, Stephan Weiss, Robert Mahony
Abstract:
The extended Kalman filter (EKF) has been the industry standard for state estimation problems over the past sixty years. The Invariant Extended Kalman Filter (IEKF) is a recent development of the EKF for the class of group-affine systems on Lie groups that has shown superior performance for inertial navigation problems. The IEKF comes in two versions, left- and right- handed respectively, and there is a perception in the robotics community that these filters are different and one should choose the handedness of the IEKF to match handedness of the measurement model for a given filtering problem. In this paper, we revisit these algorithms and demonstrate that the left- and right- IEKF algorithms (with reset step) are identical, that is, the choice of the handedness does not affect the IEKF's performance when the reset step is properly implemented. The reset step was not originally proposed as part of the IEKF, however, we provide simulations to show that the reset step improves asymptotic performance of all versions of the the filter, and should be included in all high performance algorithms. The GNSS-aided inertial navigation system (INS) is used as a motivating example to demonstrate the equivalence of the two filters.
Chinese: 不变扩展卡尔曼滤波器(IEKF)作为EKF在群仿射系统上的发展,证明了在正确实施重置步骤时,左右手版本是等效的,且重置步骤能提升所有版本滤波器的渐近性能。
English: The Invariant Extended Kalman Filter (IEKF), an advanced version of the EKF for group-affine systems, demonstrates that its left- and right-handed variants are identical when the reset step is implemented, which also enhances the asymptotic performance of all filter versions.
Authors:Zheyuan Liu, Munan Ning, Qihui Zhang, Shuo Yang, Zhongrui Wang, Yiwei Yang, Xianzhe Xu, Yibing Song, Weihua Chen, Fan Wang, Li Yuan
Abstract:
Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a condition-aware attention mechanism, enabling precise spatial control and semantic injection. Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34.7% in complex scene spatial accuracy, thereby validating the effectiveness of this entangled generation paradigm.
中文: CoT-Diff提出了一种将多模态大语言模型驱动的3D布局规划与扩散过程紧密结合的新框架,通过在生成过程中动态更新布局并实现精确空间控制,显著提升了图像的空间对齐效果,在复杂场景中的空间准确率比现有最佳方法高出34.7%。
English: CoT-Diff introduces a novel framework that integrates MLLM-driven 3D layout planning with the diffusion process, enabling dynamic layout updates and precise spatial control during image generation, which significantly enhances spatial alignment and outperforms existing methods by 34.7% in complex scenes.
Authors:Tim Beyer, Yan Scholten, Leo Schwinn, Stephan Günnemann
Abstract:
To guarantee safe and robust deployment of large language models (LLMs) at scale, it is critical to accurately assess their adversarial robustness. Existing adversarial attacks typically target harmful responses in single-point, greedy generations, overlooking the inherently stochastic nature of LLMs. In this paper, we propose a novel framework for adversarial robustness evaluation that explicitly models the entire output distribution, including tail-risks, providing better estimates for model robustness at scale. By casting the attack process as a resource allocation problem between optimization and sampling, we determine compute-optimal tradeoffs and show that integrating sampling into existing attacks boosts ASR by up to 48% and improves efficiency by up to two orders of magnitude. Our framework also enables us to analyze how different attack algorithms affect output harm distributions. Surprisingly, we find that most optimization strategies have little effect on output harmfulness. Finally, we introduce a data-free proof-of-concept objective based on entropy-maximization to demonstrate how our tail-aware perspective enables new optimization targets. Overall, our findings highlight the importance of tail-aware attacks and evaluation protocols to accurately assess and strengthen LLM safety.
中文: 该框架通过建模完整输出分布和优化资源分配,显著提升大语言模型对抗鲁棒性评估的攻击成功率与效率,同时为尾部风险感知的新型优化目标提供实现路径。
English: The proposed framework enhances adversarial robustness evaluation of large language models by modeling the entire output distribution and optimizing resource allocation, significantly boosting attack success rates and efficiency while enabling new optimization targets.
Authors:Debodeep Banerjee, Burcu Sayin, Stefano Teso, Andrea Passerini
Abstract:
Medical decision-making is a critical task, where errors can result in serious, potentially life-threatening consequences. While full automation remains challenging, hybrid frameworks that combine machine intelligence with human oversight offer a practical alternative. In this paper, we present MedGellan, a lightweight, annotation-free framework that uses a Large Language Model (LLM) to generate clinical guidance from raw medical records, which is then used by a physician to predict diagnoses. MedGellan uses a Bayesian-inspired prompting strategy that respects the temporal order of clinical data. Preliminary experiments show that the guidance generated by the LLM with MedGellan improves diagnostic performance, particularly in recall and $F_1$ score.
中文: MedGellan是一种轻量级、无需标注的框架,通过大语言模型结合贝叶斯启发式提示策略,从原始医疗记录生成临床指导,在医生参与下显著提升了诊断召回率和F1分数。
English: MedGellan is a lightweight, annotation-free framework that employs a Large Language Model with Bayesian-inspired prompting to generate clinical guidance from raw medical records, enhancing diagnostic performance in recall and F1 score under physician oversight.
Authors:Yueqiao Jin, Kaixun Yang, Roberto Martinez-Maldonado, Dragan GaÅ¡eviÄ, Lixiang Yan
Abstract:
Academic writing increasingly involves multimodal tasks requiring students to integrate visual information and textual arguments. While generative AI (GenAI) tools, like ChatGPT, offer new pathways for supporting academic writing, little is known about how students' GenAI literacy influences their independent multimodal writing skills or how chatbot interaction strategies (passive reactive vs. proactive scaffolding) impact learning. This study examined 79 higher education students' multimodal academic writing performance using a comparative research design. Students completed writing tasks integrating visual data under two chatbot-assisted conditions (passive vs. proactive) and subsequently without AI assistance. Their writing performance was rigorously evaluated across five dimensions, including insightfulness, visual data integration, organisation, linguistic quality, and critical thinking. Ordinal logistic regression and correlation analyses revealed that higher levels of GenAI literacy significantly predicted stronger independent multimodal writing performance immediately after AI assistance removal, particularly for students using passive chatbots requiring active prompting. These results highlight the critical role of GenAI literacy and specific chatbot interaction strategies in shaping students' capacities for independent multimodal academic writing. Our findings emphasise the need for purposeful integration of GenAI literacy training into curricula and balancing external scaffolding support with autonomous learning opportunities. This research offers valuable recommendations for educators leveraging AI-enhanced pedagogies to optimise student writing outcomes and technological engagement strategies.
中文摘要:较高的生成式AI素养显著提升学生独立多模态写作能力,尤其是在使用被动式聊天机器人时,这凸显了将AI素养培训与平衡性教学支架融入课程的必要性。
English Summary: Higher GenAI literacy significantly enhances students' independent multimodal writing performance, especially when using passive chatbots, underscoring the need for integrating GenAI literacy training and balanced scaffolding in curricula.
Authors:Yuanhe Tian, Chen Su, Junwen Duan, Yan Song
Abstract:
Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual encoders to independently extract features from medical images and clinical questions, which are subsequently combined to generate answers. Specifically, in computed tomography (CT), such approaches are similar to the conventional practices in medical image analysis. However, these approaches pay less attention to the spatial continuity and inter-slice correlations in the volumetric CT data, leading to fragmented and imprecise responses. In this paper, we propose a novel large language model (LLM)-based framework enhanced by a graph representation of salient features. Different from conventional multimodal encoding strategies, our approach constructs a cross-modal graph integrating both visual and textual features, treating individual CT slices and question tokens as nodes within the graph. We further leverage an attentive graph convolutional network to dynamically fuse information within this structure. The resulting aggregated graph features then serve as a soft prompt to guide a large language model in generating accurate answers. Extensive experiments on the M3D-VQA benchmark demonstrate that our approach consistently outperforms baselines across multiple evaluation metrics, offering more robust reasoning capabilities.
中文摘要:本文提出了一种基于大语言模型的新型医学视觉问答框架,通过构建融合CT图像和问题特征的跨模态图,利用注意力图卷积网络动态整合信息,从而在M3D-VQA基准测试中显著优于现有方法,实现了更精准的临床应答。
English Summary: This paper introduces a novel LLM-based framework for medical VQA that uses a cross-modal graph to integrate visual and textual features from CT scans and questions, achieving superior performance by capturing spatial continuity and inter-slice correlations through attentive graph convolutions.
Authors:Runcong Zhao, Artem Bobrov, Jiazheng Li, Yulan He
Abstract:
Effective feedback is essential for student learning but is time-intensive for teachers. We present LearnLens, a modular, LLM-based system that generates personalised, curriculum-aligned feedback in science education. LearnLens comprises three components: (1) an error-aware assessment module that captures nuanced reasoning errors; (2) a curriculum-grounded generation module that uses a structured, topic-linked memory chain rather than traditional similarity-based retrieval, improving relevance and reducing noise; and (3) an educator-in-the-loop interface for customisation and oversight. LearnLens addresses key challenges in existing systems, offering scalable, high-quality feedback that empowers both teachers and students.
中文: LearnLens是一个基于大型语言模型的模块化系统,通过错误感知评估、课程基础生成和教师监督,为科学教育提供个性化、与课程一致的反馈,解决可扩展性和质量难题。
English: LearnLens is a modular, LLM-based system that generates personalized, curriculum-aligned feedback in science education through error-aware assessment, curriculum-grounded generation, and educator oversight, addressing scalability and quality challenges.
Authors:Yan Scholten, Sophie Xhonneux, Leo Schwinn, Stephan Günnemann
Abstract:
Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their training objectives. We argue this not only risks reinforcing exposure to sensitive data, it also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method - Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from the model. Our core idea is to leverage this collapse for unlearning by triggering collapse partially on the sensitive data. We theoretically analyze that our approach converges to the desired outcome, i.e. the LLM unlearns the information in the forget set. We empirically demonstrate that PMC overcomes two key limitations of existing unlearning approaches that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs. Overall, our contributions represent an important step toward more comprehensive unlearning that aligns with real-world privacy constraints. Code available at https://www.cs.cit.tum.de/daml/partial-model-collapse/.
中文摘要:现有大语言模型遗忘方法因在训练中包含敏感数据而存在强化暴露风险,而我们提出的部分模型崩溃方法无需遗忘目标即可有效移除私有信息。
English Summary: Current LLM unlearning methods risk reinforcing sensitive data by including it in training, but our proposed Partial Model Collapse method effectively removes private information without requiring unlearning targets in the objective.
Authors:Yufan Chen, Daoyuan Wu, Juantao Zhong, Zicheng Zhang, Debin Gao, Shuai Wang, Yingjiu Li, Ning Liu
Abstract:
Malware Family Classification (MFC) aims to identify the fine-grained family (e.g., GuLoader or BitRAT) to which a potential malware sample belongs, in contrast to malware detection or sample classification that predicts only an Yes/No. Accurate family identification can greatly facilitate automated sample labeling and understanding on crowdsourced malware analysis platforms such as VirusTotal and MalwareBazaar, which generate vast amounts of data daily. In this paper, we explore and assess the feasibility of using traditional binary string features for MFC in the new era of large language models (LLMs) and Retrieval-Augmented Generation (RAG). Specifically, we investigate how Family-Specific String (FSS) features could be utilized in a manner similar to RAG to facilitate MFC. To this end, we develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices in four major modules.
中文: 本研究探讨了在新一代大语言模型和检索增强生成技术背景下,利用传统二进制字符串特征进行恶意软件家族分类的可行性,并通过针对67个家族的详细消融实验评估了其效果。
English: This study investigates the feasibility of using traditional binary string features for malware family classification by adapting them through a Retrieval-Augmented Generation approach, evaluating their effectiveness across 67 families with detailed ablation studies.
Authors:Leyan Xue, Zongbo Han, Guangyu Wang, Qinghua Hu, Mingyue Cheng, Changqing Zhang
Abstract:
Vision-Language Models (VLMs) like CLIP achieve cross-modal semantic alignment through contrastive learning, exhibiting robust zero-shot generalization. Traditional prompt engineering, however, predominantly relies on coarse-grained category labels, neglecting fine-grained local semantics. Existing approaches assume that VLMs inherently recognize localized visual details and attempt to enhance classification by augmenting text prompts with attribute descriptors generated by large language models. However, our systematic experiments reveal critical limitations: CLIP's strong bias toward global image patterns hinders its ability to process localized visual descriptors. To address this fundamental constraint, we propose a simple, effective, and plug-and-play solution that enables CLIP to ``See Both the Forest and the Trees." Specifically, we employ stochastic multi-crop augmentation to activate CLIP's latent capacity for localized feature analysis. By cropping only partial regions, the approach effectively constrains the model's receptive field and recalibrates its attention mechanism, thereby mitigating its inherent bias. We evaluate the proposed method under zero-shot, few-shot, and test-time adaptation settings, and extensive experiments demonstrate that D&D achieves promising performance.
中文摘要:本研究揭示了CLIP对全局图像模式的偏好限制了其利用局部视觉特征的能力,并提出一种随机多裁剪方法激活其细粒度分析潜力,在多种测试场景中均实现了性能提升。
English Summary: This study reveals that CLIP's bias towards global image patterns limits its use of local visual descriptors and proposes a stochastic multi-crop method to activate its latent capacity for fine-grained analysis, achieving improved performance across multiple settings.
Authors:Zizhao Wang, Kaixin Wang, Li Zhao, Peter Stone, Jiang Bian
Abstract:
World models aim to capture the dynamics of the environment, enabling agents to predict and plan for future states. In most scenarios of interest, the dynamics are highly centered on interactions among objects within the environment. This motivates the development of world models that operate on object-centric rather than monolithic representations, with the goal of more effectively capturing environment dynamics and enhancing compositional generalization. However, the development of object-centric world models has largely been explored in environments with limited visual complexity (such as basic geometries). It remains underexplored whether such models can generalize to more complex settings with diverse textures and cluttered scenes. In this paper, we fill this gap by introducing Dyn-O, an enhanced structured world model built upon object-centric representations. Compared to prior work in object-centric representations, Dyn-O improves in both learning representations and modeling dynamics. On the challenging Procgen games, we find that our method can learn object-centric world models directly from pixel observations, outperforming DreamerV3 in rollout prediction accuracy. Furthermore, by decoupling object-centric features into dynamics-agnostic and dynamics-aware components, we enable finer-grained manipulation of these features and generate more diverse imagined trajectories.
中文:世界模型旨在捕捉环境动态,而像Dyn-O这样的以对象为中心的方法通过改进表示学习和动态建模,在复杂视觉场景中提升了预测准确性和泛化能力。
English: World models capture environmental dynamics, and object-centric approaches like Dyn-O enhance prediction accuracy and generalization in complex visual settings by improving representation learning and dynamics modeling.
Authors:Xinyang Li, Gen Li, Zhihui Lin, Yichen Qian, GongXin Yao, Weinan Jia, Aowen Wang, Weihua Chen, Fan Wang
Abstract:
Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of the virtual metaverse. Recently, diffusion models have become a popular generative technique in this field with their strong generation capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual artifacts caused by the implicit latent space of Variational Auto-Encoders (VAE), which complicates the diffusion process; 2) a lack of authentic facial expressions and head movements due to inadequate multi-modal information fusion. In this paper, MoDA handles these challenges by: 1) defining a joint parameter space that bridges motion generation and neural rendering, and leveraging flow matching to simplify diffusion learning; 2) introducing a multi-modal diffusion architecture to model the interaction among noisy motion, audio, and auxiliary conditions, enhancing overall facial expressiveness. In addition, a coarse-to-fine fusion strategy is employed to progressively integrate different modalities, ensuring effective feature fusion. Experimental results demonstrate that MoDA improves video diversity, realism, and efficiency, making it suitable for real-world applications. Project Page: https://lixinyyang.github.io/MoDA.github.io/
中文:MoDA通过构建联合参数空间与流匹配技术及多模态扩散架构,解决了基于扩散模型的说话头生成中的效率低下与表情失真问题,显著提升了真实感与实用性。
English: MoDA addresses inefficiencies and expression limitations in diffusion-based talking head generation by introducing a joint parameter space with flow matching and a multi-modal diffusion architecture, enhancing realism and efficiency.
Authors:Md Mahade Hasan, Muhammad Waseem, Kai-Kristian Kemell, Jussi Rasku, Juha Ala-Rantala, Pekka Abrahamsson
Abstract:
The recent advancements of Small Language Models (SLMs) have opened new possibilities for efficient code generation. SLMs offer lightweight and cost-effective alternatives to Large Language Models (LLMs), making them attractive for use in resource-constrained environments. However, empirical understanding of SLMs, particularly their capabilities, limitations, and performance trade-offs in code generation remains limited. This study presents a comprehensive empirical evaluation of 20 open-source SLMs ranging from 0.4B to 10B parameters on five diverse code-related benchmarks (HumanEval, MBPP, Mercury, HumanEvalPack, and CodeXGLUE). The models are assessed along three dimensions: i) functional correctness of generated code, ii) computational efficiency and iii) performance across multiple programming languages. The findings of this study reveal that several compact SLMs achieve competitive results while maintaining a balance between performance and efficiency, making them viable for deployment in resource-constrained environments. However, achieving further improvements in accuracy requires switching to larger models. These models generally outperform their smaller counterparts, but they require much more computational power. We observe that for 10% performance improvements, models can require nearly a 4x increase in VRAM consumption, highlighting a trade-off between effectiveness and scalability. Besides, the multilingual performance analysis reveals that SLMs tend to perform better in languages such as Python, Java, and PHP, while exhibiting relatively weaker performance in Go, C++, and Ruby. However, statistical analysis suggests these differences are not significant, indicating a generalizability of SLMs across programming languages. Based on the findings, this work provides insights into the design and selection of SLMs for real-world code generation tasks.
中文: 小语言模型在资源受限环境中能以较低成本实现高效的代码生成,尽管更大模型能提升准确性但需更高算力,且各编程语言间性能差异不显著,展现出良好的通用性。
English: Small Language Models (SLMs) offer efficient code generation with competitive performance in resource-constrained settings, though larger models yield higher accuracy at significant computational cost, while maintaining generalizability across programming languages.
Authors:Ya-Ting Yang, Quanyan Zhu
Abstract:
Cyber warfare has become a critical dimension of modern conflict, driven by society's increasing dependence on interconnected digital and physical infrastructure. Effective cyber defense often requires decision-making at different echelons, where the tactical layer focuses on detailed actions such as techniques, tactics, and procedures, while the strategic layer addresses long-term objectives and coordinated planning. Modeling these interactions at different echelons remains challenging due to the dynamic, large-scale, and interdependent nature of cyber environments. To address this, we propose a multi-resolution dynamic game framework in which the tactical layer captures fine-grained interactions using high-resolution extensive-form game trees, while the strategic layer is modeled as a Markov game defined over lower-resolution states abstracted from those game trees. This framework supports scalable reasoning and planning across different levels of abstraction through zoom-in and zoom-out operations that adjust the granularity of the modeling based on operational needs. A case study demonstrates how the framework works and its effectiveness in improving the defender's strategic advantage.
中文: 该多分辨率动态博弈框架通过精细化博弈树建模战术交互,并利用抽象马尔可夫博弈实现战略协调,从而提升网络防御的可扩展性和决策优势。
English: The proposed multi-resolution dynamic game framework enables scalable cyber defense planning by modeling tactical interactions with detailed game trees and strategic coordination through abstracted Markov games, enhancing defenders' operational adaptability.
Authors:Hui Xie, Yuhe Liu, Shaoqi Yang, Jinyang Guo, Yufei Guo, Yuqing Ma, Jiaxin Chen, Jiaheng Liu, Xianglong Liu
Abstract:
While deep spiking neural networks (SNNs) demonstrate superior performance, their deployment on resource-constrained neuromorphic hardware still remains challenging. Network pruning offers a viable solution by reducing both parameters and synaptic operations (SynOps) to facilitate the edge deployment of SNNs, among which search-based pruning methods search for the SNNs structure after pruning. However, existing search-based methods fail to directly use SynOps as the constraint because it will dynamically change in the searching process, resulting in the final searched network violating the expected SynOps target. In this paper, we introduce a novel SNN pruning framework called SPEAR, which leverages reinforcement learning (RL) technique to directly use SynOps as the searching constraint. To avoid the violation of SynOps requirements, we first propose a SynOps prediction mechanism called LRE to accurately predict the final SynOps after search. Observing SynOps cannot be explicitly calculated and added to constrain the action in RL, we propose a novel reward called TAR to stabilize the searching. Extensive experiments show that our SPEAR framework can effectively compress SNN under specific SynOps constraint.
中文摘要:SPEAR框架通过强化学习技术直接以突触操作为约束进行SNN剪枝,采用精确预测机制和新型奖励方法,在满足资源限制的同时有效压缩网络规模。
English Summary: The SPEAR framework uses reinforcement learning to directly constrain synaptic operations during SNN pruning, introducing accurate prediction and reward mechanisms to ensure compliance with resource targets while effectively compressing networks.
Authors:Yuxuan Wang, Tianwei Cao, Huayu Zhang, Zhongjiang He, Kongming Liang, Zhanyu Ma
Abstract:
Image generation has achieved remarkable progress with the development of large-scale text-to-image models, especially diffusion-based models. However, generating human images with plausible details, such as faces or hands, remains challenging due to insufficient supervision of local regions during training. To address this issue, we propose FairHuman, a multi-objective fine-tuning approach designed to enhance both global and local generation quality fairly. Specifically, we first construct three learning objectives: a global objective derived from the default diffusion objective function and two local objectives for hands and faces based on pre-annotated positional priors. Subsequently, we derive the optimal parameter updating strategy under the guidance of the Minimum Potential Delay (MPD) criterion, thereby attaining fairness-ware optimization for this multi-objective problem. Based on this, our proposed method can achieve significant improvements in generating challenging local details while maintaining overall quality. Extensive experiments showcase the effectiveness of our method in improving the performance of human image generation under different scenarios.
中文摘要:FairHuman方法通过多目标微调策略,在保持图像整体质量的同时显著提升了人像生成中面部和手部等局部细节的真实感,实现了全局与局部优化的公平协调。
English Summary: The proposed FairHuman method enhances human image generation by employing a multi-objective fine-tuning approach that balances global and local detail optimization, significantly improving facial and hand realism while preserving overall image quality.
Authors:Tan Pan, Zhaorui Tan, Kaiyu Guo, Dongli Xu, Weidi Xu, Chen Jiang, Xin Guo, Yuan Qi, Yuan Cheng
Abstract:
3D medical image self-supervised learning (mSSL) holds great promise for medical analysis. Effectively supporting broader applications requires considering anatomical structure variations in location, scale, and morphology, which are crucial for capturing meaningful distinctions. However, previous mSSL methods partition images with fixed-size patches, often ignoring the structure variations. In this work, we introduce a novel perspective on 3D medical images with the goal of learning structure-aware representations. We assume that patches within the same structure share the same semantics (semantic consistency) while those from different structures exhibit distinct semantics (semantic discrepancy). Based on this assumption, we propose an mSSL framework named $S^2DC$, achieving Structure-aware Semantic Discrepancy and Consistency in two steps. First, $S^2DC$ enforces distinct representations for different patches to increase semantic discrepancy by leveraging an optimal transport strategy. Second, $S^2DC$ advances semantic consistency at the structural level based on neighborhood similarity distribution. By bridging patch-level and structure-level representations, $S^2DC$ achieves structure-aware representations. Thoroughly evaluated across 10 datasets, 4 tasks, and 3 modalities, our proposed method consistently outperforms the state-of-the-art methods in mSSL.
中文摘要:本研究提出的$S^2DC$框架通过增强不同解剖结构间的语义差异性和保持相同结构内的语义一致性,实现了三维医学图像的结构感知自监督学习,在多项医学影像任务和数据集中均展现出超越现有方法的优异性能。
English Summary: The proposed $S^2DC$ framework introduces structure-aware self-supervised learning for 3D medical images by enforcing semantic discrepancy between different anatomical structures and semantic consistency within the same structures, achieving superior performance across multiple medical imaging tasks and datasets.
Authors:Xiao Li, Liangji Zhu, Anand Rangarajan, Sanjay Ranka
Abstract:
Generative models have demonstrated strong performance in conditional settings and can be viewed as a form of data compression, where the condition serves as a compact representation. However, their limited controllability and reconstruction accuracy restrict their practical application to data compression. In this work, we propose an efficient latent diffusion framework that bridges this gap by combining a variational autoencoder with a conditional diffusion model. Our method compresses only a small number of keyframes into latent space and uses them as conditioning inputs to reconstruct the remaining frames via generative interpolation, eliminating the need to store latent representations for every frame. This approach enables accurate spatiotemporal reconstruction while significantly reducing storage costs. Experimental results across multiple datasets show that our method achieves up to 10 times higher compression ratios than rule-based state-of-the-art compressors such as SZ3, and up to 63 percent better performance than leading learning-based methods under the same reconstruction error.
Chinese: 本研究提出了一种高效的潜在扩散框架,通过关键帧进行生成插值来增强数据压缩,在相同重建误差下实现了比现有最优方法显著更高的压缩比和性能提升。
English: This study introduces an efficient latent diffusion framework that enhances data compression by using keyframes for generative interpolation, achieving significantly higher compression ratios and better performance than state-of-the-art methods.
Authors:Lu Yan, Zhuo Zhang, Xiangzhe Xu, Shengwei An, Guangyu Shen, Zhou Xuan, Xuan Chen, Xiangyu Zhang
Abstract:
Large language models (LLMs) have democratized software development, reducing the expertise barrier for programming complex applications. This accessibility extends to malicious software development, raising significant security concerns. While LLM providers have implemented alignment mechanisms to prevent direct generation of overtly malicious code, these safeguards predominantly evaluate individual prompts in isolation, overlooking a critical vulnerability: malicious operations can be systematically decomposed into benign-appearing sub-tasks. In this paper, we introduce the Malware Generation Compiler (MGC), a novel framework that leverages this vulnerability through modular decomposition and alignment-evasive generation. MGC employs a specialized Malware Description Intermediate Representation (MDIR) to bridge high-level malicious intents and benign-appearing code snippets. Extensive evaluation demonstrates that our attack reliably generates functional malware across diverse task specifications and categories, outperforming jailbreaking methods by +365.79% and underground services by +78.07% in correctness on three benchmark datasets. Case studies further show that MGC can reproduce and even enhance 16 real-world malware samples. This work provides critical insights for security researchers by exposing the risks of compositional attacks against aligned AI systems. Demonstrations are available at https://sites.google.com/view/malware-generation-compiler.
中文摘要:恶意软件生成编译器(MGC)框架通过将恶意操作分解为看似良性的子任务,利用了大语言模型安全防护的关键漏洞,能够可靠生成功能性恶意软件,其正确性显著超越现有越狱方法和地下服务。
English Summary: The Malware Generation Compiler (MGC) framework exploits a critical vulnerability in LLM safeguards by systematically breaking down malicious operations into benign-looking sub-tasks, successfully generating functional malware that significantly outperforms existing methods.
Authors:Yang Li, Youssef Emad, Karthik Padthe, Jack Lanchantin, Weizhe Yuan, Thao Nguyen, Jason Weston, Shang-Wen Li, Dong Wang, Ilia Kulikov, Xian Li
Abstract:
Recent work has shown that distilling reasoning traces from a larger teacher model via supervised finetuning outperforms reinforcement learning with the smaller student model alone (Guo et al. 2025). However, there has not been a systematic study of what kind of reasoning demonstrations from the teacher are most effective in improving the student model's reasoning capabilities. In this work we curate high-quality "NaturalThoughts" by selecting reasoning traces from a strong teacher model based on a large pool of questions from NaturalReasoning (Yuan et al. 2025). We first conduct a systematic analysis of factors that affect distilling reasoning capabilities, in terms of sample efficiency and scalability for general reasoning tasks. We observe that simply scaling up data size with random sampling is a strong baseline with steady performance gains. Further, we find that selecting difficult examples that require more diverse reasoning strategies is more sample-efficient to transfer the teacher model's reasoning skills. Evaluated on both Llama and Qwen models, training with NaturalThoughts outperforms existing reasoning datasets such as OpenThoughts, LIMO, etc. on general STEM reasoning benchmarks including GPQA-Diamond, MMLU-Pro and SuperGPQA.
中文: 近期研究表明,通过监督微调从大型教师模型中提取推理轨迹优于仅使用小型学生模型的强化学习,系统分析发现选择需要多样化推理策略的难题能更高效地迁移推理能力。
English: Recent research demonstrates that distilling reasoning traces from a larger teacher model through supervised fine-tuning surpasses reinforcement learning with smaller student models, and systematic analysis reveals that selecting challenging examples requiring diverse reasoning strategies enhances sample efficiency in transferring reasoning skills.
Authors:Juan Chen, Baolong Bi, Wei Zhang, Jingyan Sui, Xiaofei Zhu, Yuanzhuo Wang, Lingrui Mei, Shenghua Liu
Abstract:
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating their parametric knowledge with external retrieved content. However, knowledge conflicts caused by internal inconsistencies or noisy retrieved content can severely undermine the generation reliability of RAG systems.In this work, we argue that LLMs should rethink all evidence, including both retrieved content and internal knowledge, before generating responses.We propose CARE-RAG (Conflict-Aware and Reliable Evidence for RAG), a novel framework that improves trustworthiness through Conflict-Driven Summarization of all available evidence.CARE-RAG first derives parameter-aware evidence by comparing parameter records to identify diverse internal perspectives. It then refines retrieved evidences to produce context-aware evidence, removing irrelevant or misleading content. To detect and summarize conflicts, we distill a 3B LLaMA3.2 model to perform conflict-driven summarization, enabling reliable synthesis across multiple sources.To further ensure evaluation integrity, we introduce a QA Repair step to correct outdated or ambiguous benchmark answers.Experiments on revised QA datasets with retrieval data show that CARE-RAG consistently outperforms strong RAG baselines, especially in scenarios with noisy or conflicting evidence.
中文: CARE-RAG框架通过冲突驱动的证据整合,在噪声或冲突场景下显著提升检索增强生成系统的可靠性,实验证明其优于现有基线方法。
English: CARE-RAG is a novel framework that enhances RAG system reliability by performing conflict-driven summarization of both internal and retrieved evidence, outperforming baselines in handling noisy or conflicting information.
Authors:Can Cui, ZIye Jia, Chao Dong, Qihui Wu
Abstract:
Unmanned aerial vehicles (UAVs) are recognized as a promising candidate for the multi-access edge computing (MEC) in the future sixth generation communication networks. However, the aerial eavesdropping UAVs (EUAVs) pose a significant security threat to the data offloading. In this paper, we investigate a robust MEC scenario with multiple service UAVs (SUAVs) towards the potential eavesdropping from the EUAV, in which the random parameters such as task complexities are considered in the practical applications. In detail, the problem is formulated to optimize the deployment positions of SUAVs, the connection relationships between GUs and SUAVs, and the offloading ratios. With the uncertain task complexities, the corresponding chance constraints are constructed under the uncertainty set, which is tricky to deal with. Therefore, we first optimize the pre-deployment of SUAVs by the K-means algorithm. Then, the distributionally robust optimization method is employed, and the conditional value at risk is utilized to transform the chance constraints into convex forms, which can be solved via convex toolkits. Finally, the simulation results show that with the consideration of uncertainties, just 5% more energy is consumed compared with the ideal circumstance, which verifies the robustness of the proposed algorithms.
中文摘要:本文针对无人机边缘计算中的空中窃听威胁,提出了一种结合K-means算法和分布式鲁棒优化的解决方案,在考虑任务复杂度不确定性的情况下,仅比理想场景多消耗5%能量,验证了算法的鲁棒性。
English Summary: This paper proposes a robust optimization framework for multi-access edge computing using UAVs to counter aerial eavesdropping threats, employing K-means and distributionally robust optimization methods that demonstrate only 5% additional energy consumption despite uncertainties.
Authors:Xingyu Su, Xiner Li, Masatoshi Uehara, Sunwoo Kim, Yulai Zhao, Gabriele Scalia, Ehsan Hajiramezanali, Tommaso Biancalani, Degui Zhi, Shuiwang Ji
Abstract:
We address the problem of fine-tuning diffusion models for reward-guided generation in biomolecular design. While diffusion models have proven highly effective in modeling complex, high-dimensional data distributions, real-world applications often demand more than high-fidelity generation, requiring optimization with respect to potentially non-differentiable reward functions such as physics-based simulation or rewards based on scientific knowledge. Although RL methods have been explored to fine-tune diffusion models for such objectives, they often suffer from instability, low sample efficiency, and mode collapse due to their on-policy nature. In this work, we propose an iterative distillation-based fine-tuning framework that enables diffusion models to optimize for arbitrary reward functions. Our method casts the problem as policy distillation: it collects off-policy data during the roll-in phase, simulates reward-based soft-optimal policies during roll-out, and updates the model by minimizing the KL divergence between the simulated soft-optimal policy and the current model policy. Our off-policy formulation, combined with KL divergence minimization, enhances training stability and sample efficiency compared to existing RL-based methods. Empirical results demonstrate the effectiveness and superior reward optimization of our approach across diverse tasks in protein, small molecule, and regulatory DNA design.
中文摘要:本研究提出了一种基于迭代蒸馏的微调框架,使扩散模型能够在生物分子设计中优化任意奖励函数,相比基于强化学习的方法,显著提升了训练稳定性和样本效率。
English Summary: This study introduces an iterative distillation-based fine-tuning framework that enables diffusion models to optimize for arbitrary reward functions in biomolecular design, enhancing stability and sample efficiency compared to RL-based methods.
Authors:Tanmay Vilas Samak, Chinmay Vilas Samak, Bing Li, Venkat Krovi
Abstract:
Simulation frameworks have been key enablers for the development and validation of autonomous driving systems. However, existing methods struggle to comprehensively address the autonomy-oriented requirements of balancing: (i) dynamical fidelity, (ii) photorealistic rendering, (iii) context-relevant scenario orchestration, and (iv) real-time performance. To address these limitations, we present a unified framework for creating and curating high-fidelity digital twins to accelerate advancements in autonomous driving research. Our framework leverages a mix of physics-based and data-driven techniques for developing and simulating digital twins of autonomous vehicles and their operating environments. It is capable of reconstructing real-world scenes and assets (real2sim) with geometric and photorealistic accuracy and infusing them with various physical properties to enable real-time dynamical simulation of the ensuing driving scenarios. Additionally, it also incorporates a large language model (LLM) interface to flexibly edit the driving scenarios online via natural language prompts. We analyze the presented framework in terms of its fidelity, performance, and serviceability. Results indicate that our framework can reconstruct 3D scenes and assets with up to 97% structural similarity, while maintaining frame rates above 60 Hz. We also demonstrate that it can handle natural language prompts to generate diverse driving scenarios with up to 95% repeatability and 85% generalizability.
中文: 该统一框架通过融合物理建模与数据驱动技术,构建了用于自动驾驶的高保真数字孪生系统,在保持60Hz实时仿真的同时,实现了97%的场景重建精度和基于大语言模型的自然语言交互编辑功能。
English: The proposed unified framework creates high-fidelity digital twins for autonomous driving by combining physics-based and data-driven techniques, achieving real-time simulation with 97% structural accuracy and natural language scenario editing via LLMs.
Authors:Chinmay Vilas Samak, Tanmay Vilas Samak, Bing Li, Venkat Krovi
Abstract:
Simulation-based design, optimization, and validation of autonomous driving algorithms have proven to be crucial for their improvement over the years. Nevertheless, the ultimate measure of effectiveness is their successful transition from simulation to reality (sim2real). However, existing sim2real transfer methods struggle to address the autonomy-oriented requirements of balancing: (i) conditioned domain adaptation, (ii) robust performance with limited examples, (iii) modularity in handling multiple domain representations, and (iv) real-time performance. To alleviate these pain points, we present a unified framework for learning cross-domain adaptive representations through conditional latent diffusion for sim2real transferable autonomous driving algorithms. Our framework offers options to leverage: (i) alternate foundation models, (ii) a few-shot fine-tuning pipeline, and (iii) textual as well as image prompts for mapping across given source and target domains. It is also capable of generating diverse high-quality samples when diffusing across parameter spaces such as times of day, weather conditions, seasons, and operational design domains. We systematically analyze the presented framework and report our findings in terms of performance benchmarks and ablation studies, with critical quantitative metrics as well as insightful qualitative examples and remarks. Additionally, we demonstrate the serviceability of sim2real diffusion for autonomous driving using a behavioral cloning case study. Our experiments indicate that the proposed framework is capable of bridging the perceptual sim2real gap by over 40%, which highlights the potential of diffusion models in sim2real transfer.
Chinese: 本文提出了一种基于条件潜在扩散的统一框架,用于改进自动驾驶的仿真到现实迁移,通过解决领域适应、小样本学习、模块化和实时性需求,将感知层面的仿真与现实差距降低了超过40%。
English: This paper introduces a unified framework using conditional latent diffusion to enhance sim2real transfer for autonomous driving by addressing domain adaptation, few-shot learning, modularity, and real-time needs, achieving over 40% reduction in the perceptual sim2real gap.
Authors:Liu Li, Qiang Ma, Cheng Ouyang, Johannes C. Paetzold, Daniel Rueckert, Bernhard Kainz
Abstract:
Deep learning-based medical image segmentation techniques have shown promising results when evaluated based on conventional metrics such as the Dice score or Intersection-over-Union. However, these fully automatic methods often fail to meet clinically acceptable accuracy, especially when topological constraints should be observed, e.g., continuous boundaries or closed surfaces. In medical image segmentation, the correctness of a segmentation in terms of the required topological genus sometimes is even more important than the pixel-wise accuracy. Existing topology-aware approaches commonly estimate and constrain the topological structure via the concept of persistent homology (PH). However, these methods are difficult to implement for high dimensional data due to their polynomial computational complexity. To overcome this problem, we propose a novel and fast approach for topology-aware segmentation based on the Euler Characteristic ($Ï$). First, we propose a fast formulation for $Ï$ computation in both 2D and 3D. The scalar $Ï$ error between the prediction and ground-truth serves as the topological evaluation metric. Then we estimate the spatial topology correctness of any segmentation network via a so-called topological violation map, i.e., a detailed map that highlights regions with $Ï$ errors. Finally, the segmentation results from the arbitrary network are refined based on the topological violation maps by a topology-aware correction network. Our experiments are conducted on both 2D and 3D datasets and show that our method can significantly improve topological correctness while preserving pixel-wise segmentation accuracy.
中文: 基于深度学习的医学图像分割虽在传统指标上表现良好,但常无法满足临床拓扑精度要求,而我们基于欧拉特性的新方法能高效提升拓扑正确性,同时保持分割精度。
English: Deep learning-based medical image segmentation often fails to meet clinical topological accuracy despite strong conventional metric performance, but our novel Euler Characteristic-based method efficiently improves topological correctness while maintaining segmentation accuracy.
Authors:Yi Zhang, Erik Leo HaÃ, Kuo-Yi Chao, Nenad Petrovic, Yinglei Song, Chengdong Wu, Alois Knoll
Abstract:
Autonomous driving systems face significant challenges in achieving human-like adaptability, robustness, and interpretability in complex, open-world environments. These challenges stem from fragmented architectures, limited generalization to novel scenarios, and insufficient semantic extraction from perception. To address these limitations, we propose a unified Perception-Language-Action (PLA) framework that integrates multi-sensor fusion (cameras, LiDAR, radar) with a large language model (LLM)-augmented Vision-Language-Action (VLA) architecture, specifically a GPT-4.1-powered reasoning core. This framework unifies low-level sensory processing with high-level contextual reasoning, tightly coupling perception with natural language-based semantic understanding and decision-making to enable context-aware, explainable, and safety-bounded autonomous driving. Evaluations on an urban intersection scenario with a construction zone demonstrate superior performance in trajectory tracking, speed prediction, and adaptive planning. The results highlight the potential of language-augmented cognitive frameworks for advancing the safety, interpretability, and scalability of autonomous driving systems.
中文: 提出的感知-语言-行动框架通过多传感器融合与GPT-4.1驱动的推理核心相结合,实现了具备环境感知能力的自动驾驶,在城市场景中通过改进的轨迹跟踪和自适应规划展现出优越性能。
English: The proposed Perception-Language-Action framework integrates multi-sensor fusion with a GPT-4.1-powered reasoning core to enable context-aware autonomous driving, demonstrating superior performance in urban scenarios through enhanced trajectory tracking and adaptive planning.
Authors:Sara Cavallero, Fabio Saggese, Junya Shiraishi, Israel Leyva-Mayorga, Shashi Raj Pandey, Chiara Buratti, Petar Popovski
Abstract:
This paper introduces a dual-mode communication framework for wireless devices that integrates query-driven (pull) and event-driven (push) transmissions within a unified time-frame structure. Devices typically respond to information requests in pull mode, but if an anomaly is detected, they preempt the regular response to report the critical condition. Additionally, push-based communication is used to proactively send critical data without waiting for a request. This adaptive approach ensures timely, context-aware, and efficient data delivery across different network conditions. To achieve high energy efficiency, we incorporate a wake-up radio mechanism and we design a tailored medium access control (MAC) protocol that supports data traffic belonging to the different communication classes. A comprehensive system-level analysis is conducted, accounting for the wake-up control operation and evaluating three key performance metrics: the success probability of anomaly reports (push traffic), the success probability of query responses (pull traffic) and the total energy consumption. Numerical results characterize the system's behavior and highlight the inherent trade-off in success probabilities between push- and pull-based traffic as a function of allocated communication resources. Our analysis demonstrates that the proposed approach reduces energy consumption by up to 30% compared to a traditional approach, while maintaining reliable support for both communication paradigms.
本文提出了一种双模无线通信框架,在统一结构中整合了查询驱动和事件驱动的传输模式,通过自适应MAC协议和唤醒无线电机制实现了高达30%的能耗降低,同时保证了两种通信模式的可靠性能。
This paper presents a dual-mode wireless communication framework that combines pull and push transmissions within a unified structure, achieving up to 30% energy savings through an adaptive MAC protocol and wake-up radio while maintaining reliable performance for both traffic types.
Authors:Murong Xu, Tamaz Amiranashvili, Fernando Navarro, Maksym Fritsak, Ibrahim Ethem Hamamci, Suprosanna Shit, Bastian Wittmann, Sezgin Er, Sebastian M. Christ, Ezequiel de la Rosa, Julian Deseoe, Robert Graf, Hendrik Möller, Anjany Sekuboyina, Jan C. Peeken, Sven Becker, Giulia Baldini, Johannes Haubold, Felix Nensa, René Hosch, Nikhil Mirajkar, Saad Khalid, Stefan Zachow, Marc-André Weber, Georg Langs, Jakob Wasserthal, Mehmet Kemal Ozdemir, Andrey Fedorov, Ron Kikinis, Stephanie Tanadini-Lang, Jan S. Kirschke, Stephanie E. Combs, Bjoern Menze
Abstract:
Accurate delineation of anatomical structures in volumetric CT scans is crucial for diagnosis and treatment planning. While AI has advanced automated segmentation, current approaches typically target individual structures, creating a fragmented landscape of incompatible models with varying performance and disparate evaluation protocols. Foundational segmentation models address these limitations by providing a holistic anatomical view through a single model. Yet, robust clinical deployment demands comprehensive training data, which is lacking in existing whole-body approaches, both in terms of data heterogeneity and, more importantly, anatomical coverage. In this work, rather than pursuing incremental optimizations in model architecture, we present CADS, an open-source framework that prioritizes the systematic integration, standardization, and labeling of heterogeneous data sources for whole-body CT segmentation. At its core is a large-scale dataset of 22,022 CT volumes with complete annotations for 167 anatomical structures, representing a significant advancement in both scale and coverage, with 18 times more scans than existing collections and 60% more distinct anatomical targets. Building on this diverse dataset, we develop the CADS-model using established architectures for accessible and automated full-body CT segmentation. Through comprehensive evaluation across 18 public datasets and an independent real-world hospital cohort, we demonstrate advantages over SoTA approaches. Notably, thorough testing of the model's performance in segmentation tasks from radiation oncology validates its direct utility for clinical interventions. By making our large-scale dataset, our segmentation models, and our clinical software tool publicly available, we aim to advance robust AI solutions in radiology and make comprehensive anatomical analysis accessible to clinicians and researchers alike.
中文: CADS框架通过整合大规模数据集和标准化标注,开发了全身CT分割模型,在广泛验证中展现出优于现有技术的性能,并公开资源以推动临床人工智能应用。
English: The CADS framework introduces a large-scale dataset and model for whole-body CT segmentation, offering comprehensive anatomical coverage and demonstrating superior performance over state-of-the-art methods through extensive validation, with open access to advance clinical AI applications.
Authors:Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong
Abstract:
Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.
中文摘要:VL-Cogito通过创新的渐进课程强化学习框架,动态调整任务难度与推理路径长度,在数学、科学、逻辑等多模态推理基准测试中实现了最优性能。
English Summary: VL-Cogito is a multimodal reasoning model trained with a novel Progressive Curriculum Reinforcement Learning framework that dynamically adjusts task difficulty and reasoning path length, achieving state-of-the-art performance across diverse benchmarks.
Authors:Yaolun Zhang, Xiaogeng Liu, Chaowei Xiao
Abstract:
Large Language Models (LLMs) have demonstrated the ability to solve a wide range of practical tasks within multi-agent systems. However, existing human-designed multi-agent frameworks are typically limited to a small set of pre-defined scenarios, while current automated design methods suffer from several limitations, such as the lack of tool integration, dependence on external training data, and rigid communication structures. In this paper, we propose MetaAgent, a finite state machine based framework that can automatically generate a multi-agent system. Given a task description, MetaAgent will design a multi-agent system and polish it through an optimization algorithm. When the multi-agent system is deployed, the finite state machine will control the agent's actions and the state transitions. To evaluate our framework, we conduct experiments on both text-based tasks and practical tasks. The results indicate that the generated multi-agent system surpasses other auto-designed methods and can achieve a comparable performance with the human-designed multi-agent system, which is optimized for those specific tasks.
Chinese: MetaAgent是一个基于有限状态机的自动化框架,能够设计和优化多智能体系统,在克服通信僵化和工具整合不足等局限的同时,实现了与人工设计系统相媲美的性能。
English: MetaAgent is an automated framework that designs and optimizes multi-agent systems using finite state machines, achieving performance comparable to human-designed systems while overcoming limitations like rigid communication and lack of tool integration.
Authors:Rawan Al-Matham, Kareem Darwish, Raghad Al-Rasheed, Waad Alshammari, Muneera Alhoshan, Amal Almazrua, Asma Al Wazrah, Mais Alheraki, Firoj Alam, Preslav Nakov, Norah Alzahrani, Eman alBilali, Nizar Habash, Abdelrahman El-Sheikh, Muhammad Elmallah, Haonan Li, Hamdy Mubarak, Mohamed Anwar, Zaid Alyafeai, Ahmed Abdelali, Nora Altwairesh, Maram Hasanain, Abdulmohsen Al Thubaity, Shady Shehata, Bashar Alhafni, Injy Hamed, Go Inoue, Khalid Elmadani, Ossama Obeid, Fatima Haouari, Tamer Elsayed, Emad Alghamdi, Khalid Almubarak, Saied Alshahrani, Ola Aljarrah, Safa Alajlan, Areej Alshaqarawi, Maryam Alshihri, Sultana Alghurabi, Atikah Alzeghayer, Afrah Altamimi, Abdullah Alfaifi, Abdulrahman AlOsaimy
Abstract:
The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.
中文: 由于数据稀缺和语言复杂性,大语言模型在阿拉伯语中表现欠佳,为此推出了BALSAM这一社区驱动的基准和平台,旨在标准化评估并促进合作研究,以提升阿拉伯语大语言模型的能力。
English: Large Language Models underperform in Arabic due to data scarcity and linguistic complexity, prompting the introduction of BALSAM, a community-driven benchmark and platform to standardize evaluation and foster collaborative research for advancing Arabic LLM capabilities.
Authors:Zengyang Li, Yimeng Li, Binbin Huang, Peng Liang, Ran Mo, Hui Liu, Yutao Ma
Abstract:
Multilingual programming, which involves using multiple programming languages (PLs) in a single project, is increasingly common due to its benefits. However, it introduces cross-language bugs (CLBs), which arise from interactions between different PLs and are difficult to detect by single-language bug detection tools. This paper investigates the potential of pre-trained code language models (CodeLMs) in CLB detection. We developed CLCFinder, a cross-language code identification tool, and constructed a CLB dataset involving three PL combinations (Python-C/C++, Java-C/C++, and Python-Java) with nine interaction types. We fine-tuned 13 CodeLMs on this dataset and evaluated their performance, analyzing the effects of dataset size, token sequence length, and code comments. Results show that all CodeLMs performed poorly before fine-tuning, but exhibited varying degrees of performance improvement after fine-tuning, with UniXcoder-base achieving the best F1 score (0.7407). Notably, small fine-tuned CodeLMs tended to performe better than large ones. CodeLMs fine-tuned on single-language bug datasets performed poorly on CLB detection, demonstrating the distinction between CLBs and single-language bugs. Additionally, increasing the fine-tuning dataset size significantly improved performance, while longer token sequences did not necessarily improve the model performance. The impact of code comments varied across models. Some fine-tuned CodeLMs' performance was improved, while others showed degraded performance.
中文: 本文研究了使用预训练代码语言模型检测多语言编程中的跨语言错误,发现微调能显著提升模型性能,其中UniXcoder-base表现最佳,并揭示了跨语言错误与单语言错误的本质区别。
English: This paper explores using pre-trained code language models (CodeLMs) to detect cross-language bugs (CLBs) in multilingual programming, finding that fine-tuning significantly improves their performance, with UniXcoder-base achieving the best results, while highlighting the distinct nature of CLBs compared to single-language bugs.
Authors:Yuying Zhang, Kevin Sebastian Luck, Francesco Verdoja, Ville Kyrki, Joni Pajarinen
Abstract:
Mobile manipulation is a critical capability for robots operating in diverse, real-world environments. However, manipulating deformable objects and materials remains a major challenge for existing robot learning algorithms. While various benchmarks have been proposed to evaluate manipulation strategies with rigid objects, there is still a notable lack of standardized benchmarks that address mobile manipulation tasks involving deformable objects.
To address this gap, we introduce MoDeSuite, the first Mobile Manipulation Deformable Object task suite, designed specifically for robot learning. MoDeSuite consists of eight distinct mobile manipulation tasks covering both elastic objects and deformable objects, each presenting a unique challenge inspired by real-world robot applications. Success in these tasks requires effective collaboration between the robot's base and manipulator, as well as the ability to exploit the deformability of the objects. To evaluate and demonstrate the use of the proposed benchmark, we train two state-of-the-art reinforcement learning algorithms and two imitation learning algorithms, highlighting the difficulties encountered and showing their performance in simulation. Furthermore, we demonstrate the practical relevance of the suite by deploying the trained policies directly into the real world with the Spot robot, showcasing the potential for sim-to-real transfer. We expect that MoDeSuite will open a novel research domain in mobile manipulation involving deformable objects. Find more details, code, and videos at https://sites.google.com/view/modesuite/home.
中文: 针对机器人移动操作中可变形物体操控的挑战,我们推出了MoDeSuite这一全新任务套件,包含八项现实应用启发的任务,通过仿真和Spot机器人的实际部署验证了算法性能,旨在推动该领域的研究发展。
English: Mobile manipulation of deformable objects remains a challenging area in robotics, prompting the introduction of MoDeSuite, a novel task suite designed to evaluate robot learning algorithms through eight real-world inspired tasks and demonstrate their performance via simulation and real-world deployment with Spot robots.
Authors:Xinye Cao, Hongcan Guo, Guoshun Nan, Jiaoyang Cui, Haoting Qian, Yihan Lin, Yilin Peng, Diyang Zhang, Yanzhao Hou, Huici Wu, Xiaofeng Tao, Tony Q. S. Quek
Abstract:
Interactive multimodal applications (IMAs), such as route planning in the Internet of Vehicles, enrich users' personalized experiences by integrating various forms of data over wireless networks. Recent advances in large language models (LLMs) utilize mixture-of-experts (MoE) mechanisms to empower multiple IMAs, with each LLM trained individually for a specific task that presents different business workflows. In contrast to existing approaches that rely on multiple LLMs for IMAs, this paper presents a novel paradigm that accomplishes various IMAs using a single compositional LLM over wireless networks. The two primary challenges include 1) guiding a single LLM to adapt to diverse IMA objectives and 2) ensuring the flexibility and efficiency of the LLM in resource-constrained mobile environments. To tackle the first challenge, we propose ContextLoRA, a novel method that guides an LLM to learn the rich structured context among IMAs by constructing a task dependency graph. We partition the learnable parameter matrix of neural layers for each IMA to facilitate LLM composition. Then, we develop a step-by-step fine-tuning procedure guided by task relations, including training, freezing, and masking phases. This allows the LLM to learn to reason among tasks for better adaptation, capturing the latent dependencies between tasks. For the second challenge, we introduce ContextGear, a scheduling strategy to optimize the training procedure of ContextLoRA, aiming to minimize computational and communication costs through a strategic grouping mechanism. Experiments on three benchmarks show the superiority of the proposed ContextLoRA and ContextGear. Furthermore, we prototype our proposed paradigm on a real-world wireless testbed, demonstrating its practical applicability for various IMAs. We will release our code to the community.
中文: 本文提出了一种新颖的范式,利用单一组合式大语言模型在无线网络中处理多种交互式多模态应用,通过ContextLoRA实现任务适应和ContextGear优化调度,经实验和实际原型验证了其有效性。
English: This paper introduces a novel paradigm using a single compositional large language model (LLM) to handle multiple interactive multimodal applications (IMAs) over wireless networks, addressing challenges through ContextLoRA for task adaptation and ContextGear for efficient scheduling, validated by experiments and a real-world prototype.
Authors:Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, Xianpeng Lang, Hang Zhao
Abstract:
Vision-Language Models (VLMs) are advancing autonomous driving, yet their potential is constrained by myopic decision-making and passive perception, limiting reliability in complex environments. We introduce DriveAgent-R1 to tackle these challenges in long-horizon, high-level behavioral decision-making. DriveAgent-R1 features two core innovations: a Hybrid-Thinking framework that adaptively switches between efficient text-based and in-depth tool-based reasoning, and an Active Perception mechanism with a vision toolkit to proactively resolve uncertainties, thereby balancing decision-making efficiency and reliability. The agent is trained using a novel, three-stage progressive reinforcement learning strategy designed to master these hybrid capabilities. Extensive experiments demonstrate that DriveAgent-R1 achieves state-of-the-art performance, outperforming even leading proprietary large multimodal models, such as Claude Sonnet 4. Ablation studies validate our approach and confirm that the agent's decisions are robustly grounded in actively perceived visual evidence, paving a path toward safer and more intelligent autonomous systems.
中文: DriveAgent-R1是首个具备主动感知能力的自动驾驶代理,能主动调用工具进行视觉推理,并根据场景复杂度自适应切换推理模式,以少量参数实现与顶尖模型和人类驾驶相媲美的性能。
English: DriveAgent-R1 introduces the first autonomous driving agent with active perception, enabling it to proactively use tools for visual reasoning and adaptively switch between text-only and visual reasoning, achieving competitive performance with minimal parameters.
Authors:Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, Xianpeng Lang, Hang Zhao
Abstract:
The advent of Vision-Language Models (VLMs) has significantly advanced end-to-end autonomous driving, demonstrating powerful reasoning abilities for high-level behavior planning tasks. However, existing methods are often constrained by a passive perception paradigm, relying solely on text-based reasoning. This passivity restricts the model's capacity to actively seek crucial visual evidence when faced with uncertainty. To address this, we introduce DriveAgent-R1, the first autonomous driving agent capable of active perception for planning. In complex scenarios, DriveAgent-R1 proactively invokes tools to perform visual reasoning, firmly grounding its decisions in visual evidence, thereby enhancing both interpretability and reliability. Furthermore, we propose a hybrid thinking framework, inspired by human driver cognitive patterns, allowing the agent to adaptively switch between efficient text-only reasoning and robust tool-augmented visual reasoning based on scene complexity. This capability is cultivated through a three-stage progressive training strategy, featuring a core Cascaded Reinforcement Learning (Cascaded RL) phase. Extensive experiments on the Drive-Internal dataset, which is rich in long-tail scenarios, and the public nuScenes dataset show that, with only 3B parameters, DriveAgent-R1 achieves competitive performance comparable to top closed model systems such as GPT-5 and to human driving proficiency while remaining deployment-friendly, offering a proven path toward building more intelligent autonomous driving systems.
中文: DriveAgent-R1是首个具备主动感知能力的自动驾驶代理,能主动调用工具进行视觉推理,并根据场景复杂度自适应切换推理模式,以少量参数实现与顶尖模型和人类驾驶相媲美的性能。
English: DriveAgent-R1 introduces the first autonomous driving agent with active perception, enabling it to proactively use tools for visual reasoning and adaptively switch between text-only and visual reasoning, achieving competitive performance with minimal parameters.
Authors:Keyan Ding, Jing Yu, Junjie Huang, Yuchen Yang, Qiang Zhang, Huajun Chen
Abstract:
Scientific research increasingly relies on specialized computational tools, yet effectively utilizing these tools demands substantial domain expertise. While Large Language Models (LLMs) show promise in tool automation, they struggle to seamlessly integrate and orchestrate multiple tools for complex scientific workflows. Here, we present SciToolAgent, an LLM-powered agent that automates hundreds of scientific tools across biology, chemistry, and materials science. At its core, SciToolAgent leverages a scientific tool knowledge graph that enables intelligent tool selection and execution through graph-based retrieval-augmented generation. The agent also incorporates a comprehensive safety-checking module to ensure responsible and ethical tool usage. Extensive evaluations on a curated benchmark demonstrate that SciToolAgent significantly outperforms existing approaches. Case studies in protein engineering, chemical reactivity prediction, chemical synthesis, and metal-organic framework screening further demonstrate SciToolAgent's capability to automate complex scientific workflows, making advanced research tools accessible to both experts and non-experts.
中文:SciToolAgent是一种基于大语言模型的智能代理,通过知识图谱和安全模块自动化数百种科学工具,显著超越现有方法,使专家和非专家都能便捷处理复杂科研流程。
English: SciToolAgent is an LLM-powered agent that automates hundreds of scientific tools through a knowledge graph and safety module, significantly outperforming existing methods and making complex workflows accessible to experts and non-experts alike.
Authors:Minju Kim, Dongje Yoo, Yeonjun Hwang, Minseok Kang, Namyoung Kim, Minju Gwak, Beong-woo Kwak, Hyungjoo Chae, Harim Kim, Yunjoong Lee, Min Hee Kim, Dayi Jung, Kyong-Mee Chung, Jinyoung Yeo
Abstract:
Understanding clients' thoughts and beliefs is fundamental in counseling, yet current evaluations of LLM therapists often fail to assess this ability. Existing evaluation methods rely on client simulators that clearly disclose internal states to the therapist, making it difficult to determine whether an LLM therapist can uncover unexpressed perspectives. To address this limitation, we introduce MindVoyager, a novel evaluation framework featuring a controllable and realistic client simulator which dynamically adapts itself based on the ongoing counseling session, offering a more realistic and challenging evaluation environment. We further introduce evaluation metrics that assess the exploration ability of LLM therapists by measuring their thorough understanding of client's beliefs and thoughts.
中文摘要:现有大语言模型心理咨询师的评估因客户模拟器直接暴露内心状态而存在局限,为此我们提出MindVoyager框架,通过动态自适应的客户模拟器和专门评估指标,更真实地衡量咨询师探索客户未表达信念与想法的能力。
English Summary: Current LLM therapist evaluations are limited by unrealistic client simulators that explicitly reveal internal states, prompting the development of MindVoyager—a dynamic framework with adaptive client simulation and specialized metrics to assess therapists' ability to uncover unexpressed client perspectives.
Authors:Yiting Wang, Wanghao Ye, Yexiao He, Yiran Chen, Gang Qu, Ang Li
Abstract:
This paper presents MCP4EDA, the first Model Context Protocol server that enables Large Language Models (LLMs) to control and optimize the complete open-source RTL-to-GDSII design flow through natural language interaction. The system integrates Yosys synthesis, Icarus Verilog simulation, OpenLane place-and-route, GTKWave analysis, and KLayout visualization into a unified LLM-accessible interface, enabling designers to execute complex multi-tool EDA workflows conversationally via AI assistants such as Claude Desktop and Cursor IDE. The principal contribution is a backend-aware synthesis optimization methodology wherein LLMs analyze actual post-layout timing, power, and area metrics from OpenLane results to iteratively refine synthesis TCL scripts, establishing a closed-loop optimization system that bridges the traditional gap between synthesis estimates and physical implementation reality. In contrast to conventional flows that rely on wire-load models, this methodology leverages real backend performance data to guide synthesis parameter tuning, optimization sequence selection, and constraint refinement, with the LLM functioning as an intelligent design space exploration agent. Experimental evaluation on representative digital designs demonstrates 15-30% improvements in timing closure and 10-20% area reduction compared to default synthesis flows, establishing MCP4EDA as the first practical LLM-controlled end-to-end open-source EDA automation system. The code and demo are avaiable at: http://www.agent4eda.com/
中文: MCP4EDA是首个通过自然语言交互让大语言模型控制完整开源RTL至GDSII设计流程的模型上下文协议服务器,将多种EDA工具集成至统一接口,实现基于实际布局数据的闭环优化,相比传统流程可获得15-30%时序改善和10-20%面积缩减。
English: MCP4EDA is the first Model Context Protocol server enabling LLMs to control the complete open-source RTL-to-GDSII design flow through natural language, integrating multiple EDA tools into a unified interface for conversational AI-driven optimization.
Authors:Zhaoxi Mu, Rilin Chen, Andong Li, Meng Yu, Xinyu Yang, Dong Yu
Abstract:
This paper introduces OmniGSE, a novel general speech enhancement (GSE) framework designed to mitigate the diverse distortions that speech signals encounter in real-world scenarios. These distortions include background noise, reverberation, bandwidth limitations, signal clipping, and network packet loss. Existing methods typically focus on optimizing for a single type of distortion, often struggling to effectively handle the simultaneous presence of multiple distortions in complex scenarios. OmniGSE bridges this gap by integrating the strengths of discriminative and generative approaches through a two-stage architecture that enables cross-domain collaborative optimization. In the first stage, continuous features are enhanced using a lightweight channel-split NAC-RoFormer. In the second stage, discrete tokens are generated to reconstruct high-quality speech through language models. Specifically, we designed a hierarchical language model structure consisting of a RootLM and multiple BranchLMs. The RootLM models general acoustic features across codebook layers, while the BranchLMs explicitly capture the progressive relationships between different codebook levels. Experimental results demonstrate that OmniGSE surpasses existing models across multiple benchmarks, particularly excelling in scenarios involving compound distortions. These findings underscore the framework's potential for robust and versatile speech enhancement in real-world applications.
中文摘要:OmniGSE是一种新型语音增强框架,通过两阶段架构结合判别式和生成式方法的优势,能有效处理现实场景中多种失真同时存在的问题,在复合失真情况下表现尤为突出。
English Summary: OmniGSE is a novel speech enhancement framework that combines discriminative and generative approaches in a two-stage architecture to effectively handle multiple simultaneous distortions in real-world scenarios, outperforming existing models especially in compound distortion cases.
Authors:Binxu Li, Yuhui Zhang, Xiaohan Wang, Weixin Liang, Ludwig Schmidt, Serena Yeung-Levy
Abstract:
Mixed modality search -- retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents -- is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP's embedding space. Evaluated on MixBench -- the first benchmark specifically designed for mixed modality search -- GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75x less compute.
中文: 本研究揭示了CLIP模型在混合模态检索中存在显著的模态鸿沟问题,并提出轻量级校准方法GR-CLIP,该方法在极大降低计算成本的同时显著提升了检索精度。
English: This study identifies a significant modality gap in CLIP models that hinders mixed modality search performance, and introduces GR-CLIP, a lightweight calibration method that substantially improves retrieval accuracy while drastically reducing computational costs.
Authors:Varsha Ramineni, Hossein A. Rahmani, Emine Yilmaz, David Barber
Abstract:
As AI becomes prevalent in high-risk domains and decision-making, it is essential to test for potential harms and biases. This urgency is reflected by the global emergence of AI regulations that emphasise fairness and adequate testing, with some mandating independent bias audits. However, procuring the necessary data for fairness testing remains a significant challenge. Particularly in industry settings, legal and privacy concerns restrict the collection of demographic data required to assess group disparities, and auditors face practical and cultural challenges in gaining access to data. Further, internal historical datasets are often insufficiently representative to identify real-world biases. This work focuses on evaluating classifier fairness when complete datasets including demographics are inaccessible. We propose leveraging separate overlapping datasets to construct complete synthetic data that includes demographic information and accurately reflects the underlying relationships between protected attributes and model features. We validate the fidelity of the synthetic data by comparing it to real data, and empirically demonstrate that fairness metrics derived from testing on such synthetic data are consistent with those obtained from real data. This work, therefore, offers a path to overcome real-world data scarcity for fairness testing, enabling independent, model-agnostic evaluation of fairness, and serving as a viable substitute where real data is limited.
Chinese: 本研究针对人工智能公平性测试中人口统计数据不足的挑战,提出通过构建保留受保护属性与模型特征间关系的合成数据集,在真实数据不可得时实现可靠的公平性评估。
English: This study addresses the challenge of limited demographic data for AI fairness testing by proposing a method to create synthetic datasets that preserve the relationships between protected attributes and model features, enabling reliable fairness evaluations when real data is inaccessible.
Authors:Lars Ullrich, Michael Buchholz, Jonathan Petit, Klaus Dietmayer, Knut Graichen
Abstract:
Efficient scalability of automated driving (AD) is key to reducing costs, enhancing safety, conserving resources, and maximizing impact. However, research focuses on specific vehicles and context, while broad deployment requires scalability across various configurations and environments. Differences in vehicle types, sensors, actuators, but also traffic regulations, legal requirements, cultural dynamics, or even ethical paradigms demand high flexibility of data-driven developed capabilities. In this paper, we address the challenge of scalable adaptation of generic capabilities to desired systems and environments. Our concept follows a two-stage fine-tuning process. In the first stage, fine-tuning to the specific environment takes place through a country-specific reward model that serves as an interface between technological adaptations and socio-political requirements. In the second stage, vehicle-specific transfer learning facilitates system adaptation and governs the validation of design decisions. In sum, our concept offers a data-driven process that integrates both technological and socio-political aspects, enabling effective scalability across technical, legal, cultural, and ethical differences.
中文摘要:本文提出一种两阶段微调概念,通过国家特定奖励模型和车辆特定迁移学习,将技术与社会政治因素相结合,实现自动驾驶在多样化环境中的有效扩展。
English Summary: This paper introduces a two-stage fine-tuning concept for scalable automated driving, integrating country-specific reward models and vehicle-specific transfer learning to address both technological and socio-political challenges across diverse environments.
Authors:Nenad Petrovic, Fengjunjie Pan, Vahid Zolfaghari, Krzysztof Lebioda, Andre Schamschurko, Alois Knoll
Abstract:
This paper introduces a GenAI-empowered approach to automated development of automotive software, with emphasis on autonomous and Advanced Driver Assistance Systems (ADAS) capabilities. The process starts with requirements as input, while the main generated outputs are test scenario code for simulation environment, together with implementation of desired ADAS capabilities targeting hardware platform of the vehicle connected to testbench. Moreover, we introduce additional steps for requirements consistency checking leveraging Model-Driven Engineering (MDE). In the proposed workflow, Large Language Models (LLMs) are used for model-based summarization of requirements (Ecore metamodel, XMI model instance and OCL constraint creation), test scenario generation, simulation code (Python) and target platform code generation (C++). Additionally, Retrieval Augmented Generation (RAG) is adopted to enhance test scenario generation from autonomous driving regulations-related documents. Our approach aims shorter compliance and re-engineering cycles, as well as reduced development and testing time when it comes to ADAS-related capabilities.
中文: 本文提出一种基于生成式AI的汽车软件自动化开发方法,重点针对自动驾驶和高级驾驶辅助系统功能,利用大语言模型和检索增强生成技术从需求自动生成测试场景、仿真代码及平台专用代码,并通过模型驱动工程进行一致性检查,以缩短合规周期并提升开发效率。
English: This paper presents a GenAI-driven method for automating automotive software development, particularly for autonomous and ADAS functions, utilizing LLMs and RAG to generate test scenarios, simulation code, and platform-specific implementations from requirements, while incorporating MDE for consistency checks to accelerate compliance and reduce development time.
Authors:Yujie Ma, Lili Quan, Xiaofei Xie, Qiang Hu, Jiongchi Yu, Yao Zhang, Sen Chen
Abstract:
The rise of Large Language Models (LLMs) has led to the widespread deployment of LLM-based systems across diverse domains. As these systems proliferate, understanding the risks associated with their complex supply chains is increasingly important. LLM-based systems are not standalone as they rely on interconnected supply chains involving pretrained models, third-party libraries, datasets, and infrastructure. Yet, most risk assessments narrowly focus on model or data level, overlooking broader supply chain vulnerabilities. While recent studies have begun to address LLM supply chain risks, there remains a lack of benchmarks for systematic research.
To address this gap, we introduce the first comprehensive dataset for analyzing and benchmarking LLM supply chain security. We collect 3,859 real-world LLM applications and perform interdependency analysis, identifying 109,211 models, 2,474 datasets, and 9,862 libraries. We extract model fine-tuning paths, dataset reuse, and library reliance, mapping the ecosystem's structure. To evaluate security, we gather 1,555 risk-related issues-50 for applications, 325 for models, 18 for datasets, and 1,229 for libraries from public vulnerability databases.
Using this dataset, we empirically analyze component dependencies and risks. Our findings reveal deeply nested dependencies in LLM applications and significant vulnerabilities across the supply chain, underscoring the need for comprehensive security analysis. We conclude with practical recommendations to guide researchers and developers toward safer, more trustworthy LLM-enabled systems.
中文: 本研究首次引入综合分析大语言模型供应链安全的数据集,揭示了各组件间深度依赖关系和显著安全漏洞,并为构建更安全的系统提出了实用建议。
English: This study introduces the first comprehensive dataset to analyze LLM supply chain security, revealing deep dependencies and significant vulnerabilities across components, and provides practical recommendations for safer systems.
Authors:Shanbo Cheng, Yu Bao, Zhichao Huang, Yu Lu, Ningxin Peng, Lu Xu, Runsheng Yu, Rong Cao, Yujiao Du, Ting Han, Yuxiang Hu, Zeyang Li, Sitong Liu, Shengtao Ma, Shiguang Pan, Jiongchen Xiao, Nuo Xu, Meng Yang, Rong Ye, Yiming Yu, Jun Zhang, Ruofei Zhang, Wanyi Zhang, Wenhao Zhu, Liehao Zou, Lu Lu, Yuxuan Wang, Yonghui Wu
Abstract:
Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.
中文:Seed-LiveInterpret 2.0作为端到端同声传译模型,通过突破性双工框架实现了高精度翻译与近实时语音克隆,在翻译质量和延迟控制上显著优于商业解决方案。
English: Seed-LiveInterpret 2.0 is an end-to-end simultaneous interpretation model that overcomes key industry challenges by achieving high translation accuracy with near-real-time latency and voice cloning, significantly outperforming existing commercial solutions.
Authors:Zhe Xu, Ziyi Liu, Junlin Hou, Jiabo Ma, Cheng Jin, Yihui Wang, Zhixuan Chen, Zhengyu Zhang, Fuxiang Huang, Zhengrui Guo, Fengtao Zhou, Yingxue Xu, Xi Wang, Ronald Cheong Kin Chan, Li Liang, Hao Chen
Abstract:
Multimodal large language models (MLLMs) have emerged as powerful tools for computational pathology, offering unprecedented opportunities to integrate pathological images with language context for comprehensive diagnostic analysis. These models hold particular promise for automating complex tasks that traditionally require expert interpretation of pathologists. However, current MLLM approaches in pathology demonstrate significantly constrained reasoning capabilities, primarily due to their reliance on expensive chain-of-thought annotations. Additionally, existing methods remain limited to simplex application of visual question answering (VQA) at the region-of-interest (ROI) level, failing to address the full spectrum of diagnostic needs such as ROI classification, detection, segmentation, whole-slide-image (WSI) classification and VQA in clinical practice. In this study, we present SmartPath-R1, a versatile MLLM capable of simultaneously addressing both ROI-level and WSI-level tasks while demonstrating robust pathological reasoning capability. Our framework combines scale-dependent supervised fine-tuning and task-aware reinforcement fine-tuning, which circumvents the requirement for chain-of-thought supervision by leveraging the intrinsic knowledge within MLLM. Furthermore, SmartPath-R1 integrates multiscale and multitask analysis through a mixture-of-experts mechanism, enabling dynamic processing for diverse tasks. We curate a large-scale dataset comprising 2.3M ROI samples and 188K WSI samples for training and evaluation. Extensive experiments across 72 tasks validate the effectiveness and superiority of the proposed approach. This work represents a significant step toward developing versatile, reasoning-enhanced AI systems for precision pathology.
Chinese: SmartPath-R1作为一种多功能多模态大语言模型,通过结合尺度相关的监督微调和任务感知的强化微调,无需思维链标注即可实现强大的病理推理能力,并能同时处理ROI和WSI级别的多种任务,在大型数据集上的广泛实验验证了其优越性。
English: SmartPath-R1 is a versatile multimodal large language model that overcomes the limitations of current approaches in computational pathology by enabling robust reasoning and handling both ROI-level and WSI-level tasks without chain-of-thought supervision, validated through extensive experiments on a large-scale dataset.
Authors:Yufei He, Ruoyu Li, Alex Chen, Yue Liu, Yulin Chen, Yuan Sui, Cheng Chen, Yi Zhu, Luca Luo, Frank Yang, Bryan Hooi
Abstract:
Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change, such as regulatory compliance and user risk screening. Current approaches, like offline fine-tuning and standard prompting, are insufficient because they cannot effectively adapt to new knowledge during actual operation. To address this limitation, we propose the Adaptive Reflective Interactive Agent (ARIA), an LLM agent framework designed specifically to continuously learn updated domain knowledge at test time. ARIA assesses its own uncertainty through structured self-dialogue, proactively identifying knowledge gaps and requesting targeted explanations or corrections from human experts. It then systematically updates an internal, timestamped knowledge repository with provided human guidance, detecting and resolving conflicting or outdated knowledge through comparisons and clarification queries. We evaluate ARIA on the realistic customer due diligence name screening task on TikTok Pay, alongside publicly available dynamic knowledge tasks. Results demonstrate significant improvements in adaptability and accuracy compared to baselines using standard offline fine-tuning and existing self-improving agents. ARIA is deployed within TikTok Pay serving over 150 million monthly active users, confirming its practicality and effectiveness for operational use in rapidly evolving environments.
中文: 自适应反思交互代理(ARIA)框架通过自我对话识别知识缺口并整合专家指导,使大语言模型能够在运行中持续学习更新领域知识,在TikTok Pay客户筛查等实际应用中显著提升了适应性和准确性。
English: The proposed Adaptive Reflective Interactive Agent (ARIA) framework enables LLM agents to continuously learn updated domain knowledge during operation by identifying knowledge gaps through self-dialogue and incorporating human expert guidance, demonstrating significant improvements in adaptability and accuracy for real-world applications like TikTok Pay's customer screening.
Authors:Ahmad M. Nazar, Abdulkadir Celik, Mohamed Y. Selim, Asmaa Abdallah, Daji Qiao, Ahmed M. Eltawil
Abstract:
Vehicular communication systems operating in the millimeter wave (mmWave) band are highly susceptible to signal blockage from dynamic obstacles such as vehicles, pedestrians, and infrastructure. To address this challenge, we propose a proactive blockage prediction framework that utilizes multi-modal sensing, including camera, GPS, LiDAR, and radar inputs in an infrastructure-to-vehicle (I2V) setting. This approach uses modality-specific deep learning models to process each sensor stream independently and fuses their outputs using a softmax-weighted ensemble strategy based on validation performance. Our evaluations, for up to 1.5s in advance, show that the camera-only model achieves the best standalone trade-off with an F1-score of 97.1% and an inference time of 89.8ms. A camera+radar configuration further improves accuracy to 97.2% F1 at 95.7ms. Our results display the effectiveness and efficiency of multi-modal sensing for mmWave blockage prediction and provide a pathway for proactive wireless communication in dynamic environments.
Chinese: 所提出的主动阻塞预测框架利用多模态感知技术,有效预测车辆通信中毫米波信号的遮挡问题,通过优化的传感器融合实现了高达97.2%的F1分数精度。
English: The proposed proactive blockage prediction framework leverages multi-modal sensing to effectively anticipate millimeter wave signal obstructions in vehicular communications, achieving high accuracy with an F1-score up to 97.2% through optimized sensor fusion.
Authors:Jens V. Rüppel, Andrey Rudenko, Tim Schreiter, Martin Magnusson, Achim J. Lilienthal
Abstract:
The rapid development of Large Language Models (LLMs) creates an exciting potential for flexible, general knowledge-driven Human-Robot Interaction (HRI) systems for assistive robots. Existing HRI systems demonstrate great progress in interpreting and following user instructions, action generation, and robot task solving. On the other hand, bi-directional, multi-modal, and context-aware support of the user in collaborative tasks still remains an open challenge. In this paper, we present a gaze- and speech-informed interface to the assistive robot, which is able to perceive the working environment from multiple vision inputs and support the dynamic user in their tasks. Our system is designed to be modular and transferable to adapt to diverse tasks and robots, and it is capable of real-time use of language-based interaction state representation and fast on board perception modules. Its development was supported by multiple public dissemination events, contributing important considerations for improved robustness and user experience. Furthermore, in two lab studies, we compare the performance and user ratings of our system with those of a traditional scripted HRI pipeline. Our findings indicate that an LLM-based approach enhances adaptability and marginally improves user engagement and task execution metrics but may produce redundant output, while a scripted pipeline is well suited for more straightforward tasks.
中文: 本文提出了一种基于大型语言模型的模块化实时辅助机器人系统,通过多模态交互提升了人机协作的适应性和用户参与度,但相比传统脚本方法可能产生冗余输出。
English: This paper introduces a modular, real-time assistive robot system using Large Language Models to enhance adaptability and user engagement in Human-Robot Interaction, though it may generate some redundant outputs compared to traditional scripted approaches.
Authors:Xuhui Zhang, Wenchao Liu, Jinke Ren, Chunjie Wang, Huijun Xing, Yanyan Shen, Shuguang Cui
Abstract:
Movable-antennas (MAs) are revolutionizing spatial signal processing by providing flexible beamforming in next-generation wireless systems. This paper investigates an MA-empowered autonomous aerial vehicle (AAV) system in low-altitude wireless networks (LAWNs) for uplink data collection from ground users. We aim to maximize the sum achievable rate by jointly optimizing the AAV trajectory, receive beamforming, and MA positions. An efficient alternating optimization (AO) algorithm that incorporates successive convex approximation, weighted minimum mean square error, and particle swarm optimization is developed. The analysis of the computational complexity and convergence features is provided. Extensive simulations demonstrate superior performance in terms of the sum achievable rate and the service reliability comparing to several benchmark schemes. These results demonstrate the distinctive advantages of the proposed scheme: enhanced spectral efficiency via adaptive beam-user alignment and improved collection reliability through spatial interference management, highlighting the implementation potential of the MA-empowered LAWNs.
中文摘要:本文提出一种可动天线增强的自主飞行器系统,通过交替优化算法联合优化飞行轨迹、波束成形和天线位置,在低空无线网络中实现了比基准方案更高的频谱效率和收集可靠性。
English Summary: This paper proposes an MA-enhanced AAV system for uplink data collection in LAWNs, using an AO algorithm to optimize trajectory, beamforming, and antenna positions, achieving higher spectral efficiency and reliability than benchmark methods.
Authors:Baran Can Gül, Suraksha Nadig, Stefanos Tziampazis, Nasser Jazdi, Michael Weyrich
Abstract:
In-vehicle emotion recognition underpins adaptive driver-assistance systems and, ultimately, occupant safety. However, practical deployment is hindered by (i) modality fragility - poor lighting and occlusions degrade vision-based methods; (ii) physiological variability - heart-rate and skin-conductance patterns differ across individuals; and (iii) privacy risk - centralized training requires transmission of sensitive data. To address these challenges, we present FedMultiEmo, a privacy-preserving framework that fuses two complementary modalities at the decision level: visual features extracted by a Convolutional Neural Network from facial images, and physiological cues (heart rate, electrodermal activity, and skin temperature) classified by a Random Forest. FedMultiEmo builds on three key elements: (1) a multimodal federated learning pipeline with majority-vote fusion, (2) an end-to-end edge-to-cloud prototype on Raspberry Pi clients and a Flower server, and (3) a personalized Federated Averaging scheme that weights client updates by local data volume. Evaluated on FER2013 and a custom physiological dataset, the federated Convolutional Neural Network attains 77% accuracy, the Random Forest 74%, and their fusion 87%, matching a centralized baseline while keeping all raw data local. The developed system converges in 18 rounds, with an average round time of 120 seconds and a per-client memory footprint below 200 MB. These results indicate that FedMultiEmo offers a practical approach to real-time, privacy-aware emotion recognition in automotive settings.
中文摘要:FedMultiEmo是一种保护隐私的多模态框架,通过联邦学习融合面部和生理数据,在保持原始数据本地化的同时实现了87%的情绪识别准确率,有效解决了车载环境中的实际部署难题。
English Summary: FedMultiEmo is a privacy-preserving multimodal framework that combines facial and physiological data through federated learning, achieving 87% emotion recognition accuracy while keeping raw data local to address deployment challenges in automotive settings.
Authors:Leonard Bauersfeld, Davide Scaramuzza
Abstract:
Autonomous quadrotor flight in confined spaces such as pipes and tunnels presents significant challenges due to unsteady, self-induced aerodynamic disturbances. Very recent advances have enabled flight in such conditions, but they either rely on constant motion through the pipe to mitigate airflow recirculation effects or suffer from limited stability during hovering. In this work, we present the first closed-loop control system for quadrotors for hovering in narrow pipes that leverages real-time flow field measurements. We develop a low-latency, event-based smoke velocimetry method that estimates local airflow at high temporal resolution. This flow information is used by a disturbance estimator based on a recurrent convolutional neural network, which infers force and torque disturbances in real time. The estimated disturbances are integrated into a learning-based controller trained via reinforcement learning. The flow-feedback control proves particularly effective during lateral translation maneuvers in the pipe cross-section. There, the real-time disturbance information enables the controller to effectively counteract transient aerodynamic effects, thereby preventing collisions with the pipe wall. To the best of our knowledge, this work represents the first demonstration of an aerial robot with closed-loop control informed by real-time flow field measurements. This opens new directions for research on flight in aerodynamically complex environments. In addition, our work also sheds light on the characteristic flow structures that emerge during flight in narrow, circular pipes, providing new insights at the intersection of robotics and fluid dynamics.
中文摘要:本研究首次提出四旋翼飞行器在狭窄管道中实现稳定悬停的闭环控制系统,通过实时气流测量和学习型控制器有效抵消空气动力扰动。
English Summary: This study introduces the first closed-loop control system for quadrotors that enables stable hovering in narrow pipes by utilizing real-time airflow measurements and a learning-based controller to counteract aerodynamic disturbances.
Authors:Takaho Shimokasa, Hiroyuki Yomo, Federico Chiariotti, Junya Shiraishi, Petar Popovski
Abstract:
This paper investigates how to achieve both low-power operations of sensor nodes and accurate state estimation using Kalman filter for internet of things (IoT) monitoring employing wireless sensor networks under radio resource constraint. We consider two policies used by the base station to collect observations from the sensor nodes: (i) an oblivious policy, based on statistics of the observations, and (ii) a decentralized policy, based on autonomous decision of each sensor based on its instantaneous observation. This work introduces a wake-up receiver and wake-up signaling to both policies to improve the energy efficiency of the sensor nodes. The decentralized policy designed with random access prioritizes transmissions of instantaneous observations that are highly likely to contribute to the improvement of state estimation. Our numerical results show that the decentralized policy improves the accuracy of the estimation in comparison to the oblivious policy under the constraint on the radio resource and consumed energy when the correlation between the processes observed by the sensor nodes is low. We also clarify the degree of correlation in which the superiority of two policies changes.
中文摘要:本研究通过为忽略和分散两种策略引入唤醒接收器及信号机制,提升了物联网监测的能效,结果表明在过程相关性低且无线资源受限时,分散策略能更精确地进行状态估计。
English Summary: This study enhances IoT monitoring by integrating wake-up receivers and signaling into both oblivious and decentralized policies to boost energy efficiency, demonstrating that the decentralized policy offers superior state estimation accuracy under low process correlation and constrained radio resources.
Authors:Nenad Petrovic, Vahid Zolfaghari, Andre Schamschurko, Sven Kirchner, Fengjunjie Pan, Chengdng Wu, Nils Purschke, Aleksei Velsh, Krzysztof Lebioda, Yinglei Song, Yi Zhang, Lukasz Mazur, Alois Knoll
Abstract:
Adoption of state-of-art Generative Artificial Intelligence (GenAI) aims to revolutionize many industrial areas by reducing the amount of human intervention needed and effort for handling complex underlying processes. Automotive software development is considered to be a significant area for GenAI adoption, taking into account lengthy and expensive procedures, resulting from the amount of requirements and strict standardization. In this paper, we explore the adoption of GenAI for various steps of automotive software development, mainly focusing on requirements handling, compliance aspects and code generation. Three GenAI-related technologies are covered within the state-of-art: Large Language Models (LLMs), Retrieval Augmented Generation (RAG), Vision Language Models (VLMs), as well as overview of adopted prompting techniques in case of code generation. Additionally, we also derive a generalized GenAI-aided automotive software development workflow based on our findings from this literature review. Finally, we include a summary of a survey outcome, which was conducted among our automotive industry partners regarding the type of GenAI tools used for their daily work activities.
中文: 生成式人工智能的应用,如大语言模型、检索增强生成和视觉语言模型,正在通过优化需求处理、合规性及代码生成来革新汽车软件开发,行业调查结果也印证了这一趋势。
English: The adoption of Generative AI, including LLMs, RAG, and VLMs, is revolutionizing automotive software development by streamlining requirements, compliance, and code generation, as supported by industry survey insights.
Authors:Xiufeng Huang, Ka Chun Cheung, Runmin Cong, Simon See, Renjie Wan
Abstract:
Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose \method, a disentangled framework for efficient 3D Gaussian prediction. Our method extracts features from local image pairs using a stereo vision backbone and fuses them via global attention blocks. Dedicated point and Gaussian prediction heads generate multi-view point-maps for geometry and Gaussian features for appearance, combined as GS-maps to represent the 3DGS object. A refinement network enhances these GS-maps for high-quality reconstruction. Unlike existing methods that depend on camera parameters, our approach achieves pose-free 3D reconstruction, improving robustness and practicality. By reducing resource demands while maintaining high-quality outputs, \method provides an efficient, scalable solution for real-world 3D content generation.
中文: 提出的\method框架有效解耦了三维高斯泼溅的几何与外观预测,实现了无需相机位姿的鲁棒重建,在降低计算需求的同时保持高质量输出。
English: The proposed \method framework efficiently disentangles geometry and appearance prediction for 3D Gaussian Splatting, enabling pose-free reconstruction with reduced computational demands while maintaining high-quality output.
Authors:Xiangyu Chen, Kaiwen Zhu, Yuandong Pu, Shuo Cao, Xiaohui Li, Wenlong Zhang, Yihao Liu, Yu Qiao, Jiantao Zhou, Chao Dong
Abstract:
Low-level vision involves a wide spectrum of tasks, including image restoration, enhancement, stylization, and feature extraction, which differ significantly in both task formulation and output domains. To address the challenge of unified modeling across such diverse tasks, we propose a Visual task Prompt-based Image Processing (VPIP) framework that leverages input-target image pairs as visual prompts to guide the model in performing a variety of low-level vision tasks. The framework comprises an end-to-end image processing backbone, a prompt encoder, and a prompt interaction module, enabling flexible integration with various architectures and effective utilization of task-specific visual representations. Based on this design, we develop a unified low-level vision model, GenLV, and evaluate its performance across multiple representative tasks. To explore the scalability of this approach, we extend the framework along two dimensions: model capacity and task diversity. We construct a large-scale benchmark consisting of over 100 low-level vision tasks and train multiple versions of the model with varying scales. Experimental results show that the proposed method achieves considerable performance across a wide range of tasks. Notably, increasing the number of training tasks enhances generalization, particularly for tasks with limited data, indicating the model's ability to learn transferable representations through joint training. Further evaluations in zero-shot generalization, few-shot transfer, and task-specific fine-tuning scenarios demonstrate the model's strong adaptability, confirming the effectiveness, scalability, and potential of the proposed framework as a unified foundation for general low-level vision modeling.
中文摘要:提出的视觉任务提示图像处理(VPIP)框架通过视觉提示统一多种底层视觉任务,在超过100项任务的实验中展现出优异性能、可扩展性和迁移学习能力。
English Summary: The proposed Visual task Prompt-based Image Processing (VPIP) framework unifies diverse low-level vision tasks through visual prompts, demonstrating strong performance, scalability, and transferability across over 100 tasks in experiments.
Authors:Xinheng Lyu, Yuci Liang, Wenting Chen, Meidan Ding, Jiaqi Yang, Guolin Huang, Daokun Zhang, Xiangjian He, Linlin Shen
Abstract:
Whole slide images (WSIs) are vital in digital pathology, enabling gigapixel tissue analysis across various pathological tasks. While recent advancements in multi-modal large language models (MLLMs) allow multi-task WSI analysis through natural language, they often underperform compared to task-specific models. Collaborative multi-agent systems have emerged as a promising solution to balance versatility and accuracy in healthcare, yet their potential remains underexplored in pathology-specific domains. To address these issues, we propose WSI-Agents, a novel collaborative multi-agent system for multi-modal WSI analysis. WSI-Agents integrates specialized functional agents with robust task allocation and verification mechanisms to enhance both task-specific accuracy and multi-task versatility through three components: (1) a task allocation module assigning tasks to expert agents using a model zoo of patch and WSI level MLLMs, (2) a verification mechanism ensuring accuracy through internal consistency checks and external validation using pathology knowledge bases and domain-specific models, and (3) a summary module synthesizing the final summary with visual interpretation maps. Extensive experiments on multi-modal WSI benchmarks show WSI-Agents's superiority to current WSI MLLMs and medical agent frameworks across diverse tasks.
中文: WSI-Agents是一种协作多智能体系统,通过专业化任务分配、验证机制和综合摘要,提升了多模态全切片图像分析的准确性和多功能性,在多种任务基准测试中优于现有模型。
English: WSI-Agents is a collaborative multi-agent system that enhances accuracy and versatility in multi-modal whole slide image analysis through specialized task allocation, verification mechanisms, and summary synthesis, outperforming existing models on benchmark tasks.
Authors:Nasir Khan, Asmaa Abdallah, Abdulkadir Celik, Ahmed M. Eltawil, Sinem Coleri
Abstract:
In line with the AI-native 6G vision, explainability and robustness are crucial for building trust and ensuring reliable performance in millimeter-wave (mmWave) systems. Efficient beam alignment is essential for initial access, but deep learning (DL) solutions face challenges, including high data collection overhead, hardware constraints, lack of explainability, and susceptibility to adversarial attacks. This paper proposes a robust and explainable DL-based beam alignment engine (BAE) for mmWave multiple-input multiple output (MIMO) systems. The BAE uses received signal strength indicator (RSSI) measurements from wide beams to predict the best narrow beam, reducing the overhead of exhaustive beam sweeping. To overcome the challenge of real-world data collection, this work leverages a site-specific digital twin (DT) to generate synthetic channel data closely resembling real-world environments. A model refinement via transfer learning is proposed to fine-tune the pre-trained model residing in the DT with minimal real-world data, effectively bridging mismatches between the digital replica and real-world environments. To reduce beam training overhead and enhance transparency, the framework uses deep Shapley additive explanations (SHAP) to rank input features by importance, prioritizing key spatial directions and minimizing beam sweeping. It also incorporates the Deep k-nearest neighbors (DkNN) algorithm, providing a credibility metric for detecting out-of-distribution inputs and ensuring robust, transparent decision-making. Experimental results show that the proposed framework reduces real-world data needs by 70%, beam training overhead by 62%, and improves outlier detection robustness by up to 8.5x, achieving near-optimal spectral efficiency and transparent decision making compared to traditional softmax based DL models.
中文: 本文提出了一种基于深度学习的毫米波MIMO系统波束对准框架,通过数字孪生和迁移学习显著降低了数据采集与波束训练开销,并利用SHAP和DkNN算法实现透明化决策与异常检测,有效提升了系统的鲁棒性和可解释性。
English: This paper introduces a robust and explainable deep learning-based beam alignment framework for mmWave MIMO systems, leveraging digital twins and transfer learning to significantly reduce data collection and beam training overhead while enhancing transparency and outlier detection through SHAP and DkNN algorithms.
Authors:Aditya Singh, Aastha Mishra, Manan Tayal, Shishir Kolathaya, Pushpak Jagtap
Abstract:
Ensuring both performance and safety is critical for autonomous systems operating in real-world environments. While safety filters such as Control Barrier Functions (CBFs) enforce constraints by modifying nominal controllers in real time, they can become overly conservative when the nominal policy lacks safety awareness. Conversely, solving State-Constrained Optimal Control Problems (SC-OCPs) via dynamic programming offers formal guarantees but is intractable in high-dimensional systems. In this work, we propose a novel two-stage framework that combines gradient-based Model Predictive Control (MPC) with CBF-based safety filtering for co-optimizing safety and performance. In the first stage, we relax safety constraints as penalties in the cost function, enabling fast optimization via gradient-based methods. This step improves scalability and avoids feasibility issues associated with hard constraints. In the second stage, we modify the resulting controller using a CBF-based Quadratic Program (CBF-QP), which enforces hard safety constraints with minimal deviation from the reference. Our approach yields controllers that are both performant and provably safe. We validate the proposed framework on two case studies, showcasing its ability to synthesize scalable, safe, and high-performance controllers for complex, high-dimensional autonomous systems.
中文: 本文提出了一种结合梯度模型预测控制与基于控制屏障函数安全过滤的两阶段框架,通过将安全约束转化为代价函数惩罚并实施硬约束修正,实现了高维自主系统中性能与安全性的协同优化。
English: This paper introduces a two-stage framework combining gradient-based MPC with CBF safety filtering to co-optimize performance and safety in autonomous systems, enabling scalable and provably safe controllers for high-dimensional applications.
Authors:Haoyang Li, Zhanchao Xu, Yiming Li, Xuejia Chen, Darian Li, Anxin Tian, Qingfa Xiao, Cheng Deng, Jun Wang, Qing Li, Lei Chen, Mingxuan Yuan
Abstract:
Multi-turn dialogues are essential in many real-world applications of large language models, such as chatbots and virtual assistants. As conversation histories become longer, existing large language models face increasing computational and memory challenges, which hinder their ability to provide efficient and responsive interactions. Most current acceleration methods either compress the context or optimize key value caching, but they often rely on fixed or position-based heuristics that do not adapt well to the dynamic and unpredictable patterns found in actual multi-turn conversations. In this paper, we present LoopServe, an adaptive dual-phase inference acceleration framework for large language models in multi-turn dialogues. LoopServe introduces two main innovations. First, it performs online sparsification during the prefilling phase by dynamically selecting the most important parts of the attention matrix for each new input. Second, it uses progressive key value compression during decoding by adaptively maintaining a relevant and efficient cache based on the most recently generated output tokens. We also propose a \href{https://huggingface.co/datasets/TreeAILab/Multi-turn_Long-context_Benchmark_for_LLMs}{new benchmark} with eleven multi-turn datasets that reflect realistic query positions and conversational dependencies. Extensive experiments demonstrate that LoopServe consistently achieves superior effectiveness compared to existing baselines and significantly accelerates LLM inference across a wide range of long-context dialogue tasks.
中文: LoopServe是一种自适应双阶段框架,通过预填充阶段动态稀疏化注意力机制和解码阶段渐进式压缩键值缓存,显著提升大语言模型在多轮对话中的推理效率,在各类长上下文对话任务中均展现出优越性能。
English: LoopServe is an adaptive dual-phase framework that accelerates large language model inference in multi-turn dialogues by dynamically sparsifying attention during prefilling and progressively compressing key-value caches during decoding, achieving superior efficiency and performance across diverse conversational tasks.
Authors:Haoyang Li, Zhanchao Xu, Yiming Li, Xuejia Chen, Darian Li, Anxin Tian, Qingfa Xiao, Cheng Deng, Jun Wang, Qing Li, Lei Chen, Mingxuan Yuan
Abstract:
Multi-turn dialogues are essential in many real-world applications of large language models, such as chatbots and virtual assistants. As conversation histories become longer, existing large language models face increasing computational and memory challenges, which hinder their ability to provide efficient and responsive interactions. Most current acceleration methods either compress the context or optimize key value caching, but they often rely on fixed or position-based heuristics that do not adapt well to the dynamic and unpredictable patterns found in actual multi-turn conversations. As a result, these models cannot accurately identify and prioritize the most relevant context, leading to degraded response quality. In this paper, we present LoopServe, an adaptive dual-phase inference acceleration framework for large language models in multi-turn dialogues. LoopServe introduces two main innovations. First, it performs online sparsification during the prefilling phase by dynamically selecting the most important parts of the attention matrix for each new input. Second, it uses progressive key value compression during decoding by adaptively maintaining a relevant and efficient cache based on the most recently generated output tokens. We also propose a new benchmark with eleven multi-turn datasets that reflect realistic query positions and conversational dependencies. Extensive experiments demonstrate that LoopServe consistently achieves superior effectiveness compared to existing baselines and significantly accelerates LLM inference across a wide range of long-context dialogue tasks.
中文: LoopServe是一种自适应双阶段框架,通过预填充阶段动态稀疏化注意力机制和解码阶段渐进式压缩键值缓存,显著提升大语言模型在多轮对话中的推理效率,在各类长上下文对话任务中均展现出优越性能。
English: LoopServe is an adaptive dual-phase framework that accelerates large language model inference in multi-turn dialogues by dynamically sparsifying attention during prefilling and progressively compressing key-value caches during decoding, achieving superior efficiency and performance across diverse conversational tasks.
Authors:Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Jingwen Chen, Zhichao Huang, Tao Li, Yifu Li, Huiying Lin, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, Yuxuan Wang, Yonghui Wu
Abstract:
Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.
Chinese: Seed-X系列开源大语言模型通过思维链推理和强化学习优化,在28种语言中实现了与顶尖闭源模型相媲美的翻译性能,并公开参数以推动研究与应用。
English: Seed-X is a family of open-source large language models that achieve translation performance comparable to leading closed-source models across 28 languages by leveraging Chain-of-Thought reasoning and reinforcement learning.
Authors:Xuanqi Gao, Weipeng Jiang, Juan Zhai, Shiqing Ma, Siyi Xie, Xinyang Yin, Chao Shen
Abstract:
The advent of neural machine translation (NMT) has revolutionized cross-lingual communication, yet preserving stylistic nuances remains a significant challenge. While existing approaches often require parallel corpora for style preservation, we introduce Babel, a novel framework that enhances stylistic fidelity in NMT using only monolingual corpora. Babel employs two key components: (1) a style detector based on contextual embeddings that identifies stylistic disparities between source and target texts, and (2) a diffusion-based style applicator that rectifies stylistic inconsistencies while maintaining semantic integrity. Our framework integrates with existing NMT systems as a post-processing module, enabling style-aware translation without requiring architectural modifications or parallel stylistic data. Extensive experiments on five diverse domains (law, literature, scientific writing, medicine, and educational content) demonstrate Babel's effectiveness: it identifies stylistic inconsistencies with 88.21% precision and improves stylistic preservation by 150% while maintaining a high semantic similarity score of 0.92. Human evaluation confirms that translations refined by Babel better preserve source text style while maintaining fluency and adequacy.
中文:Babel框架通过单语语料库提升神经机器翻译的风格保真度,利用风格检测器和基于扩散的应用器,在保持高语义准确性的同时,将风格保留能力提高150%,适用于法律、文学等多个领域。
English: The Babel framework enhances stylistic fidelity in neural machine translation using monolingual corpora, employing a style detector and diffusion-based applicator to improve style preservation by 150% while maintaining high semantic accuracy across diverse domains.
Authors:Rafiq Kamel, Filippo Guerranti, Simon Geisler, Stephan Günnemann
Abstract:
Large Language Models (LLMs) are increasingly applied to tasks involving structured inputs such as graphs. Abstract Meaning Representations (AMRs), which encode rich semantics as directed graphs, offer a rigorous testbed for evaluating LLMs on text generation from such structures. Yet, current methods often arbitrarily linearize AMRs, discarding key structural cues, or rely on architectures incompatible with standard LLMs. We introduce SAFT, a structure-aware fine-tuning approach that injects graph topology into pretrained LLMs without architectural changes. We compute direction-sensitive positional encodings from the magnetic Laplacian of transformed AMRs and project them into the embedding space of the LLM. While possibly applicable to any graph-structured inputs, we focus on AMR-to-text generation as a representative and challenging benchmark. SAFT sets a new state-of-the-art on AMR 3.0 with a 3.5 BLEU improvement over baselines. Gains scale with graph complexity, highlighting the value of structure-aware representations in enhancing LLM performance. SAFT offers a general and effective pathway for bridging structured data and language models.
中文摘要:SAFT提出了一种结构感知微调方法,通过磁拉普拉斯编码将图结构注入预训练大语言模型,在AMR到文本生成任务中实现了3.5 BLEU值的性能提升,创造了新的技术标杆。
English Summary: SAFT introduces a structure-aware fine-tuning method that injects graph topology into pretrained LLMs using magnetic Laplacian encodings, achieving state-of-the-art performance in AMR-to-text generation with a 3.5 BLEU improvement.
Authors:Baoshen Guo, Zhiqing Hong, Junyi Li, Shenhao Wang, Jinhua Zhao
Abstract:
Urban mobility data has significant connections with economic growth and plays an essential role in various smart-city applications. However, due to privacy concerns and substantial data collection costs, fine-grained human mobility trajectories are difficult to become publicly available on a large scale. A promising solution to address this issue is trajectory synthesizing. However, existing works often ignore the inherent structural complexity of trajectories, unable to handle complicated high-dimensional distributions and generate realistic fine-grained trajectories. In this paper, we propose Cardiff, a coarse-to-fine Cascaded hybrid diffusion-based trajectory synthesizing framework for fine-grained and privacy-preserving mobility generation. By leveraging the hierarchical nature of urban mobility, Cardiff decomposes the generation process into two distinct levels, i.e., discrete road segment-level and continuous fine-grained GPS-level: (i) In the segment-level, to reduce computational costs and redundancy in raw trajectories, we first encode the discrete road segments into low-dimensional latent embeddings and design a diffusion transformer-based latent denoising network for segment-level trajectory synthesis. (ii) Taking the first stage of generation as conditions, we then design a fine-grained GPS-level conditional denoising network with a noise augmentation mechanism to achieve robust and high-fidelity generation. Additionally, the Cardiff framework not only progressively generates high-fidelity trajectories through cascaded denoising but also flexibly enables a tunable balance between privacy preservation and utility. Experimental results on three large real-world trajectory datasets demonstrate that our method outperforms state-of-the-art baselines in various metrics.
Chinese: 本文提出Cardiff框架,通过级联扩散模型分层生成离散路段和连续GPS数据,合成细粒度且保护隐私的城市移动轨迹,在真实性和实用性上优于现有方法。
English: This paper introduces Cardiff, a cascaded diffusion-based framework that synthesizes fine-grained, privacy-preserving urban mobility trajectories by hierarchically generating discrete road segments and continuous GPS data, outperforming existing methods in realism and utility.
Authors:Inamullah, Ernesto Elias Vidal Rosas, Imran Razzak, Shoaib Jameel
Abstract:
Cardiovascular disease (CVD) remains the leading global cause of mortality, yet current risk stratification methods often fail to detect early, subclinical changes. Previous studies have generally not integrated retinal microvasculature characteristics with comprehensive serum lipidomic profiles as potential indicators of CVD risk. In this study, an innovative imaging omics framework was introduced, combining retinal microvascular traits derived through deep learning based image processing with serum lipidomic data to highlight asymptomatic biomarkers of cardiovascular risk beyond the conventional lipid panel. This represents the first large scale, covariate adjusted and stratified correlation analysis conducted in a healthy population, which is essential for identifying early indicators of disease. Retinal phenotypes were quantified using automated image analysis tools, while serum lipid profiling was performed by Ultra High Performance Liquid Chromatography Electrospray ionization High resolution mass spectrometry (UHPLC ESI HRMS). Strong, age- and sex-independent correlations were established, particularly between average artery width, vessel density, and lipid subclasses such as triacylglycerols (TAGs), diacylglycerols (DAGs), and ceramides (Cers). These associations suggest a converging mechanism of microvascular remodeling under metabolic stress. By linking detailed
vascular structural phenotypes to specific lipid species, this study fills a critical gap in the understanding of early CVD pathogenesis. This integration not only offers a novel perspective on microvascular metabolic associations but also presents a significant opportunity for the identification of robust, non-invasive biomarkers. Ultimately, these findings may support improved early detection, targeted prevention, and personalized approaches in cardiovascular healthcare.
Chinese: 本研究提出了一种创新的影像组学框架,将视网膜微血管特征与血清脂质组数据相结合,揭示了血管结构与特定脂质亚类之间的显著关联,为超越传统方法识别心血管疾病早期风险生物标志物提供了新途径。
English: This study introduces an innovative imaging omics framework that integrates retinal microvascular features with serum lipidomic data, revealing strong correlations between vascular structure and specific lipid subclasses to identify early biomarkers for cardiovascular disease risk beyond conventional methods.
Authors:Yuan-Chen Shu, Zhiwei Lin, Yongtao Wang
Abstract:
To address the performance limitations of the Segment Anything Model (SAM) in specific domains, existing works primarily adopt adapter-based one-step adaptation paradigms. However, some of these methods are specific developed for specific domains. If used on other domains may lead to performance degradation. This issue of catastrophic forgetting severely limits the model's scalability. To address this issue, this paper proposes RegCL, a novel non-replay continual learning (CL) framework designed for efficient multi-domain knowledge integration through model merging. Specifically, RegCL incorporates the model merging algorithm into the continual learning paradigm by merging the parameters of SAM's adaptation modules (e.g., LoRA modules) trained on different domains. The merging process is guided by weight optimization, which minimizes prediction discrepancies between the merged model and each of the domain-specific models. RegCL effectively consolidates multi-domain knowledge while maintaining parameter efficiency, i.e., the model size remains constant regardless of the number of tasks, and no historical data storage is required. Experimental results demonstrate that RegCL achieves favorable continual learning performance across multiple downstream datasets, validating its effectiveness in dynamic scenarios.
中文: 本文提出RegCL,一种无需重放数据的持续学习框架,通过合并领域特定的适配器参数来高效整合多领域知识,既能防止灾难性遗忘,又能保持固定模型规模且无需存储历史数据。
English: This paper introduces RegCL, a non-replay continual learning framework that merges domain-specific adapter parameters to integrate multi-domain knowledge efficiently while preventing catastrophic forgetting and maintaining fixed model size without storing historical data.
Authors:Kexuan Shi, Zhuang Qi, Jingjing Zhu, Lei Meng, Yaochen Zhang, Haibei Huang, Xiangxu Meng
Abstract:
Open-set few-shot image classification aims to train models using a small amount of labeled data, enabling them to achieve good generalization when confronted with unknown environments. Existing methods mainly use visual information from a single image to learn class representations to distinguish known from unknown categories. However, these methods often overlook the benefits of integrating rich contextual information. To address this issue, this paper proposes a prototypical augmentation and alignment method, termed ProtoConNet, which incorporates background information from different samples to enhance the diversity of the feature space, breaking the spurious associations between context and image subjects in few-shot scenarios. Specifically, it consists of three main modules: the clustering-based data selection (CDS) module mines diverse data patterns while preserving core features; the contextual-enhanced semantic refinement (CSR) module builds a context dictionary to integrate into image representations, which boosts the model's robustness in various scenarios; and the prototypical alignment (PA) module reduces the gap between image representations and class prototypes, amplifying feature distances for known and unknown classes. Experimental results from two datasets verified that ProtoConNet enhances the effectiveness of representation learning in few-shot scenarios and identifies open-set samples, making it superior to existing methods.
中文摘要:ProtoConNet提出了一种原型增强与对齐方法,通过整合多样本背景信息增强特征多样性,在开放集小样本图像分类中提升了模型鲁棒性和识别能力,优于现有方法。
English Summary: ProtoConNet introduces a prototypical augmentation and alignment method that leverages contextual information from multiple samples to enhance feature diversity and robustness in open-set few-shot image classification, outperforming existing approaches.
Authors:Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, Benjamin Van Durme
Abstract:
The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.
中文: 本研究推出Ettin模型套件,证明编码器模型在分类检索任务中优于解码器模型,而解码器擅长生成任务,跨架构适配效果不如专用模型。
English: The study introduces the Ettin model suite, demonstrating that encoder-only models outperform decoder-only models in classification and retrieval tasks, while decoders excel in generation, with cross-architecture adaptation proving less effective than specialized models.
Authors:Yuhao Wang, Keyan Ding, Kehua Feng, Zeyuan Wang, Ming Qin, Xiaotong Li, Qiang Zhang, Huajun Chen
Abstract:
Protein language models have emerged as powerful tools for sequence generation, offering substantial advantages in functional optimization and denovo design. However, these models also present significant risks of generating harmful protein sequences, such as those that enhance viral transmissibility or evade immune responses. These concerns underscore critical biosafety and ethical challenges. To address these issues, we propose a Knowledge-guided Preference Optimization (KPO) framework that integrates prior knowledge via a Protein Safety Knowledge Graph. This framework utilizes an efficient graph pruning strategy to identify preferred sequences and employs reinforcement learning to minimize the risk of generating harmful proteins. Experimental results demonstrate that KPO effectively reduces the likelihood of producing hazardous sequences while maintaining high functionality, offering a robust safety assurance framework for applying generative models in biotechnology.
中文摘要:提出的知识引导偏好优化(KPO)框架通过整合蛋白质安全知识图谱,在保持功能性的同时显著降低有害蛋白质序列的生成风险,为生物技术领域的生成模型提供了可靠的安全保障。
English Summary: The proposed Knowledge-guided Preference Optimization (KPO) framework effectively reduces the generation of harmful protein sequences while preserving functionality, providing a robust safety solution for protein language models in biotechnology.
Authors:Kyungtae Han, Yitao Chen, Rohit Gupta, Onur Altintas
Abstract:
While autonomous driving technologies continue to advance, current Advanced Driver Assistance Systems (ADAS) remain limited in their ability to interpret scene context or engage with drivers through natural language. These systems typically rely on predefined logic and lack support for dialogue-based interaction, making them inflexible in dynamic environments or when adapting to driver intent. This paper presents Scene-Aware Conversational ADAS (SC-ADAS), a modular framework that integrates Generative AI components including large language models, vision-to-text interpretation, and structured function calling to enable real-time, interpretable, and adaptive driver assistance. SC-ADAS supports multi-turn dialogue grounded in visual and sensor context, allowing natural language recommendations and driver-confirmed ADAS control. Implemented in the CARLA simulator with cloud-based Generative AI, the system executes confirmed user intents as structured ADAS commands without requiring model fine-tuning. We evaluate SC-ADAS across scene-aware, conversational, and revisited multi-turn interactions, highlighting trade-offs such as increased latency from vision-based context retrieval and token growth from accumulated dialogue history. These results demonstrate the feasibility of combining conversational reasoning, scene perception, and modular ADAS control to support the next generation of intelligent driver assistance.
中文: 本文提出SC-ADAS框架,通过生成式AI实现基于场景感知和自然语言对话的实时可解释驾驶辅助,但面临视觉上下文检索延迟和对话历史增长等挑战。
English: This paper introduces SC-ADAS, a modular framework that leverages Generative AI to enable real-time, interpretable driver assistance through natural language dialogue and scene-aware interactions, though it faces challenges like increased latency and token growth.
Authors:Helge Spieker, Théo Matricon, Nassim Belmecheri, Jørn Eirik Betten, Gauthier Le Bartz Lyan, Heraldo Borges, Quentin Mazouni, Dennis Gross, Arnaud Gotlieb, Mathieu Acher
Abstract:
Software systems usually provide numerous configuration options that can affect performance metrics such as execution time, memory usage, binary size, or bitrate. On the one hand, making informed decisions is challenging and requires domain expertise in options and their combinations. On the other hand, machine learning techniques can search vast configuration spaces, but with a high computational cost, since concrete executions of numerous configurations are required. In this exploratory study, we investigate whether large language models (LLMs) can assist in performance-oriented software configuration through prompts. We evaluate several LLMs on tasks including identifying relevant options, ranking configurations, and recommending performant configurations across various configurable systems, such as compilers, video encoders, and SAT solvers. Our preliminary results reveal both positive abilities and notable limitations: depending on the task and systems, LLMs can well align with expert knowledge, whereas hallucinations or superficial reasoning can emerge in other cases. These findings represent a first step toward systematic evaluations and the design of LLM-based solutions to assist with software configuration.
中文摘要:大型语言模型在协助性能导向的软件配置方面展现出识别选项和排序配置的潜力,但在不同系统中存在幻觉推理和表面化分析等显著局限性。
English Summary: Large language models show potential in aiding performance-focused software configuration by identifying options and ranking configurations, though they exhibit limitations like hallucinations and inconsistent reasoning across different systems.
Authors:Basel Mousi, Nadir Durrani, Fahim Dalvi
Abstract:
While Knowledge Editing (KE) has been widely explored in English, its behavior in morphologically rich languages like Arabic remains underexamined. In this work, we present the first study of Arabic KE. We evaluate four methods (ROME, MEMIT, ICE, and LTE) on Arabic translations of the ZsRE and Counterfact benchmarks, analyzing both multilingual and cross-lingual settings. Our experiments on Llama-2-7B-chat show show that parameter-based methods struggle with cross-lingual generalization, while instruction-tuned methods perform more robustly. We extend Learning-To-Edit (LTE) to a multilingual setting and show that joint Arabic-English training improves both editability and transfer. We release Arabic KE benchmarks and multilingual training for LTE data to support future research.
中文: 本研究开创了阿拉伯语知识编辑领域,发现基于参数的方法在跨语言任务中表现不佳,而指令调优方法更为稳健,且阿拉伯语-英语联合训练能提升编辑能力和迁移效果。
English: This study pioneers knowledge editing in Arabic, revealing that parameter-based methods falter in cross-lingual tasks while instruction-tuned approaches excel, with joint Arabic-English training enhancing performance.
Authors:Xiaojie Lin, Baihe Ma, Xu Wang, Guangsheng Yu, Ying He, Wei Ni, Ren Ping Liu
Abstract:
Driving trajectory data remains vulnerable to privacy breaches despite existing mitigation measures. Traditional methods for detecting driving trajectories typically rely on map-matching the path using Global Positioning System (GPS) data, which is susceptible to GPS data outage. This paper introduces CAN-Trace, a novel privacy attack mechanism that leverages Controller Area Network (CAN) messages to uncover driving trajectories, posing a significant risk to drivers' long-term privacy. A new trajectory reconstruction algorithm is proposed to transform the CAN messages, specifically vehicle speed and accelerator pedal position, into weighted graphs accommodating various driving statuses. CAN-Trace identifies driving trajectories using graph-matching algorithms applied to the created graphs in comparison to road networks. We also design a new metric to evaluate matched candidates, which allows for potential data gaps and matching inaccuracies. Empirical validation under various real-world conditions, encompassing different vehicles and driving regions, demonstrates the efficacy of CAN-Trace: it achieves an attack success rate of up to 90.59% in the urban region, and 99.41% in the suburban region.
Chinese: 本文提出CAN-Trace,一种利用CAN总线数据重构行驶轨迹的新型隐私攻击方法,即使在数据存在缺失的情况下,仍能在城郊和市区实现高达99.41%和90.59%的攻击成功率。
English: This paper presents CAN-Trace, a novel privacy attack that reconstructs driving trajectories from CAN bus data, achieving high success rates in both urban and suburban areas despite potential data gaps.
Authors:Chenglin Zhu, Tao Zhang, Chong Li, Mingan Lin, Zenan Zhou, Jian Xie
Abstract:
Multimodal large language models (MLLMs) still perform poorly on scientific tasks, particularly those requiring multi-step and interpretable reasoning. Their limitations include insufficient scientific reasoning patterns, lack of global coherence in multi-step inference, and the absence of reflective self-correction, making them unreliable in structured scientific contexts. We introduce EduFlow, the first end-to-end framework that covers the full pipeline of educational scientific reasoning, including data selection, MCTS-based trajectory construction, model training, and output optimization. At its core is EduPRM, a process-aware reward model that critiques reasoning steps with tags and justifications. EduPRM is trained via curriculum learning on three complementary supervision sources: MCTS-guided trajectories, error-injected critiques, and teacher-student dialogues, enabling dynamic adaptation to multi-stage problem solving and iterative refinement during inference. We further propose EduMCTS, a domain-adapted search framework that introduces bootstrapping actions specifically designed for educational reasoning, such as a self-reflection mechanism that promotes reflective error correction. It further leverages EduPRM's fine-grained feedback to guide the search toward higher-quality reasoning trajectories. By applying self-consistency and rejection sampling, we constructed EduMCTS-160K, a large-scale dataset of educational reasoning trajectories. Extensive experiments demonstrate that EduFlow enhances reasoning consistency and coherence. Code, data, and models will be released.
Chinese: EduFlow 是一个端到端框架,通过整合过程感知奖励模型和领域自适应搜索机制,利用课程学习和反思性自我修正,提升多模态大语言模型在科学任务中的推理一致性和连贯性。
English: EduFlow is an end-to-end framework designed to improve multimodal large language models' performance on scientific tasks by integrating a process-aware reward model and a domain-adapted search mechanism, which enhances reasoning consistency and coherence through curriculum learning and reflective self-correction.
Authors:Mahsa Paknejad, Parisa Fard Moshiri, Murat Simsek, Burak Kantarci, Hussein T. Mouftah
Abstract:
Vehicular Mobile Edge Computing (VEC) drives the future by enabling low-latency, high-efficiency data processing at the very edge of vehicular networks. This drives innovation in key areas such as autonomous driving, intelligent transportation systems, and real-time analytics. Despite its potential, VEC faces significant challenges, particularly in adhering to strict task offloading deadlines, as vehicles remain within the coverage area of Roadside Units (RSUs) for only brief periods. To tackle this challenge, this paper evaluates the performance boundaries of task processing by initially establishing a theoretical limit using Particle Swarm Optimization (PSO) in a static environment. To address more dynamic and practical scenarios, PSO, Deep Q-Network (DQN), and Proximal Policy Optimization (PPO) models are implemented in an online setting. The objective is to minimize dropped tasks and reduce end-to-end (E2E) latency, covering both communication and computation delays. Experimental results demonstrate that the DQN model considerably surpasses the dynamic PSO approach, achieving a 99.2% reduction in execution time. Furthermore, It leads to a reduction in dropped tasks by 2.5% relative to dynamic PSO and achieves 18.6\% lower E2E latency, highlighting the effectiveness of Deep Reinforcement Learning (DRL) in enabling scalable and efficient task management for VEC systems.
Chinese: 本文针对车载移动边缘计算(VEC)中任务卸载时限的挑战,通过实施并比较粒子群优化(PSO)、深度Q网络(DQN)和近端策略优化(PPO)模型,结果显示DQN模型显著优于动态PSO,执行时间减少99.2%,任务丢弃率降低2.5%,端到端延迟下降18.6%。
English: This paper addresses the challenge of meeting task offloading deadlines in Vehicular Mobile Edge Computing (VEC) by implementing and comparing Particle Swarm Optimization (PSO), Deep Q-Network (DQN), and Proximal Policy Optimization (PPO) models, with results showing that the DQN model significantly outperforms dynamic PSO by reducing execution time by 99.2%, decreasing dropped tasks by 2.5%, and lowering end-to-end latency by 18.6%.
Authors:Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Hossein A. Rahmani, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, Ian Soboroff
Abstract:
This is the fifth year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human-annotated training labels available for both passage and document ranking tasks. We mostly repeated last year's design, to get another matching test set, based on the larger, cleaner, less-biased v2 passage and document set, with passage ranking as primary and document ranking as a secondary task (using labels inferred from passage). As we did last year, we sample from MS MARCO queries that were completely held out, unused in corpus construction, unlike the test queries in the first three years. This approach yields a more difficult test with more headroom for improvement. Alongside the usual MS MARCO (human) queries from MS MARCO, this year we generated synthetic queries using a fine-tuned T5 model and using a GPT-4 prompt.
The new headline result this year is that runs using Large Language Model (LLM) prompting in some way outperformed runs that use the "nnlm" approach, which was the best approach in the previous four years. Since this is the last year of the track, future iterations of prompt-based ranking can happen in other tracks. Human relevance assessments were applied to all query types, not just human MS MARCO queries. Evaluation using synthetic queries gave similar results to human queries, with system ordering agreement of $Ï=0.8487$. However, human effort was needed to select a subset of the synthetic queries that were usable. We did not see clear evidence of bias, where runs using GPT-4 were favored when evaluated using synthetic GPT-4 queries, or where runs using T5 were favored when evaluated on synthetic T5 queries.
中文: 第五年TREC深度学习赛道继续采用MS MARCO数据集并引入合成查询,基于大语言模型的方法超越了此前最优方法,在人工与合成查询评估中结果一致且未显现明显偏差。
English: The TREC Deep Learning track's fifth year continued using MS MARCO datasets with improved test sets and introduced synthetic queries, where LLM-based approaches surpassed previous top methods, showing consistent results across human and synthetic evaluations without clear bias.
Authors:Shaoran Yang, Dongyu Wei, Hanzhi Yu, Zhaohui Yang, Yuchen Liu, Mingzhe Chen
Abstract:
In this paper, a novel contrastive language-image pre-training (CLIP) model based semantic communication framework is designed. Compared to standard neural network (e.g.,convolutional neural network) based semantic encoders and decoders that require joint training over a common dataset, our CLIP model based method does not require any training procedures thus enabling a transmitter to extract data meanings of the original data without neural network model training, and the receiver to train a neural network for follow-up task implementation without the communications with the transmitter. Next, we investigate the deployment of the CLIP model based semantic framework over a noisy wireless network. Since the semantic information generated by the CLIP model is susceptible to wireless noise and the spectrum used for semantic information transmission is limited, it is necessary to jointly optimize CLIP model architecture and spectrum resource block (RB) allocation to maximize semantic communication performance while considering wireless noise, the delay and energy used for semantic communication. To achieve this goal, we use a proximal policy optimization (PPO) based reinforcement learning (RL) algorithm to learn how wireless noise affect the semantic communication performance thus finding optimal CLIP model and RB for each user. Simulation results show that our proposed method improves the convergence rate by up to 40%, and the accumulated reward by 4x compared to soft actor-critic.
中文: 本文提出了一种基于CLIP模型的语义通信框架,无需联合训练即可实现数据语义提取和任务执行,并采用PPO强化学习算法联合优化模型架构和频谱资源分配以应对无线噪声,显著提升了收敛速度和系统性能。
English: This paper introduces a CLIP-based semantic communication framework that eliminates the need for joint training, enabling efficient data meaning extraction and task implementation, and employs a PPO reinforcement learning algorithm to optimize model architecture and resource allocation against wireless noise, significantly enhancing convergence and performance.
Authors:Philip Osborne, Danilo S. Carvalho, André Freitas
Abstract:
We present elsciRL, an open-source Python library to facilitate the application of language solutions on reinforcement learning problems. We demonstrate the potential of our software by extending the Language Adapter with Self-Completing Instruction framework defined in (Osborne, 2024) with the use of LLMs. Our approach can be re-applied to new applications with minimal setup requirements. We provide a novel GUI that allows a user to provide text input for an LLM to generate instructions which it can then self-complete. Empirical results indicate that these instructions \textit{can} improve a reinforcement learning agent's performance. Therefore, we present this work to accelerate the evaluation of language solutions on reward based environments to enable new opportunities for scientific discovery.
中文:elsciRL是一个开源Python库,通过结合语言模型与强化学习,提供新型图形界面生成自完成指令,能以最低配置提升智能体性能,加速语言解决方案在奖励环境中的评估。
English: The elsciRL library is an open-source Python tool that integrates language models into reinforcement learning, featuring a novel GUI for generating self-completing instructions to potentially enhance agent performance with minimal setup.
Authors:Zach Eidex, Mojtaba Safari, Tonghe Wang, Vanessa Wildman, David S. Yu, Hui Mao, Erik Middlebrooks, Aparna Kesewala, Xiaofeng Yang
Abstract:
Purpose: Ultra-high-field 7T MRI offers improved resolution and contrast over standard clinical field strengths (1.5T, 3T). However, 7T scanners are costly, scarce, and introduce additional challenges such as susceptibility artifacts. We propose an efficient transformer-based model (7T-Restormer) to synthesize 7T-quality T1-maps from routine 1.5T or 3T T1-weighted (T1W) images. Methods: Our model was validated on 35 1.5T and 108 3T T1w MRI paired with corresponding 7T T1 maps of patients with confirmed MS. A total of 141 patient cases (32,128 slices) were randomly divided into 105 (25; 80) training cases (19,204 slices), 19 (5; 14) validation cases (3,476 slices), and 17 (5; 14) test cases (3,145 slices) where (X; Y) denotes the patients with 1.5T and 3T T1W scans, respectively. The synthetic 7T T1 maps were compared against the ResViT and ResShift models. Results: The 7T-Restormer model achieved a PSNR of 26.0 +/- 4.6 dB, SSIM of 0.861 +/- 0.072, and NMSE of 0.019 +/- 0.011 for 1.5T inputs, and 25.9 +/- 4.9 dB, and 0.866 +/- 0.077 for 3T inputs, respectively. Using 10.5 M parameters, our model reduced NMSE by 64 % relative to 56.7M parameter ResShift (0.019 vs 0.052, p = <.001 and by 41 % relative to 70.4M parameter ResViT (0.019 vs 0.032, p = <.001) at 1.5T, with similar advantages at 3T (0.021 vs 0.060 and 0.033; p < .001). Training with a mixed 1.5 T + 3 T corpus was superior to single-field strategies. Restricting the model to 1.5T increased the 1.5T NMSE from 0.019 to 0.021 (p = 1.1E-3) while training solely on 3T resulted in lower performance on input 1.5T T1W MRI. Conclusion: We propose a novel method for predicting quantitative 7T MP2RAGE maps from 1.5T and 3T T1W scans with higher quality than existing state-of-the-art methods. Our approach makes the benefits of 7T MRI more accessible to standard clinical workflows.
中文: 本研究提出基于Transformer的7T-Restormer模型,能够从常规1.5T或3T磁共振图像高效合成达到7T质量的T1图,其性能优于现有先进方法,使超高场磁共振的优势更易融入标准临床工作流程。
English: The study introduces 7T-Restormer, a transformer-based model that efficiently synthesizes high-quality 7T-like T1 maps from standard 1.5T or 3T MRI scans, outperforming existing methods while enhancing accessibility to ultra-high-field MRI benefits in clinical practice.
Authors:Tianrun Yu, Jiaqi Wang, Haoyu Wang, Mingquan Lin, Han Liu, Nelson S. Yee, Fenglong Ma
Abstract:
Collaborative fairness is a crucial challenge in federated learning. However, existing approaches often overlook a practical yet complex form of heterogeneity: imbalanced covariate shift. We provide a theoretical analysis of this setting, which motivates the design of FedAKD (Federated Asynchronous Knowledge Distillation)- simple yet effective approach that balances accurate prediction with collaborative fairness. FedAKD consists of client and server updates. In the client update, we introduce a novel asynchronous knowledge distillation strategy based on our preliminary analysis, which reveals that while correctly predicted samples exhibit similar feature distributions across clients, incorrectly predicted samples show significant variability. This suggests that imbalanced covariate shift primarily arises from misclassified samples. Leveraging this insight, our approach first applies traditional knowledge distillation to update client models while keeping the global model fixed. Next, we select correctly predicted high-confidence samples and update the global model using these samples while keeping client models fixed. The server update simply aggregates all client models. We further provide a theoretical proof of FedAKD's convergence. Experimental results on public datasets (FashionMNIST and CIFAR10) and a real-world Electronic Health Records (EHR) dataset demonstrate that FedAKD significantly improves collaborative fairness, enhances predictive accuracy, and fosters client participation even under highly heterogeneous data distributions.
中文: FedAKD通过异步知识蒸馏处理联邦学习中的不平衡协变量偏移,在异构数据分布下显著提升了预测准确性、协作公平性和客户端参与度。
English: FedAKD addresses collaborative fairness in federated learning by using asynchronous knowledge distillation to handle imbalanced covariate shifts, improving both prediction accuracy and fairness across heterogeneous data distributions.
Authors:Ando Shah, Rajveer Singh, Akram Zaytar, Girmaw Abebe Tadesse, Caleb Robinson, Negar Tafti, Stephen A. Wood, Rahul Dodhia, Juan M. Lavista Ferres
Abstract:
Rice cultivation consumes 24-30% of global freshwater, creating critical water management challenges in major rice-producing regions. Sustainable irrigation practices like direct seeded rice (DSR) and alternate wetting and drying (AWD) can reduce water use by 20-40% while maintaining yields, helping secure long-term agricultural productivity as water scarcity intensifies - a key component of the Zero Hunger Sustainable Development Goal. However, limited data on adoption rates of these practices prevents evidence-based policymaking and targeted resource allocation. We developed a novel remote sensing framework to monitor sustainable water management practices at scale in Punjab, India - a region facing severe groundwater depletion of 41.6 cm/year. To collect essential ground truth data, we partnered with the Nature Conservancy's Promoting Regenerative and No-burn Agriculture (PRANA) program, which trained approximately 1,400 farmers on water-saving techniques while documenting their field-level practices. Using this data, we created a classification system with Sentinel-1 satellite imagery that separates water management along sowing and irrigation dimensions. Our approach achieved a 78% F1-score in distinguishing DSR from traditional puddled transplanted rice without requiring prior knowledge of planting dates. We demonstrated scalability by mapping DSR adoption across approximately 3 million agricultural plots in Punjab, with district-level predictions showing strong correlation (Pearson=0.77, RBO= 0.77) with government records. This study provides policymakers with a powerful tool to track sustainable water management adoption, target interventions, and measure program impacts at scale.
中文: 本研究开发了一种新型遥感框架,用于监测印度旁遮普邦的可持续水资源管理实践,在识别节水技术方面达到78%的准确率,并能进行大规模测绘,为农业节水政策制定提供数据支持。
English: A novel remote sensing framework was developed to monitor sustainable water management practices in Punjab, India, achieving 78% accuracy in identifying water-saving techniques and enabling large-scale mapping to support evidence-based policymaking for agricultural water conservation.
Authors:Yang Xiao, Ting Dang, Rohan Kumar Das
Abstract:
Automatic speaker verification (ASV) systems are often affected by spoofing attacks. Recent transformer-based models have improved anti-spoofing performance by learning strong feature representations. However, these models usually need high computing power. To address this, we introduce RawTFNet, a lightweight CNN model designed for audio signals. The RawTFNet separates feature processing along time and frequency dimensions, which helps to capture the fine-grained details of synthetic speech. We tested RawTFNet on the ASVspoof 2021 LA and DF evaluation datasets. The results show that RawTFNet reaches comparable performance to that of the state-of-the-art models, while also using fewer computing resources. The code and models will be made publicly available.
中文: RawTFNet是一种轻量级CNN模型,通过分别处理时间和频率维度来有效检测说话人验证中的欺骗攻击,在降低计算需求的同时达到了先进性能水平。
English: RawTFNet is a lightweight CNN model that processes time and frequency dimensions separately to effectively detect spoofing attacks in speaker verification, achieving state-of-the-art performance with reduced computational demands.
Authors:Zhuang Qi, Ying-Peng Tang, Lei Meng, Han Yu, Xiaoxiao Li, Xiangxu Meng
Abstract:
Federated Class Incremental Learning (FCIL) aims to collaboratively process continuously increasing incoming tasks across multiple clients. Among various approaches, data replay has become a promising solution, which can alleviate forgetting by reintroducing representative samples from previous tasks. However, their performance is typically limited by class imbalance, both within the replay buffer due to limited global awareness and between replayed and newly arrived classes. To address this issue, we propose a class wise balancing data replay method for FCIL (FedCBDR), which employs a global coordination mechanism for class-level memory construction and reweights the learning objective to alleviate the aforementioned imbalances. Specifically, FedCBDR has two key components: 1) the global-perspective data replay module reconstructs global representations of prior task in a privacy-preserving manner, which then guides a class-aware and importance-sensitive sampling strategy to achieve balanced replay; 2) Subsequently, to handle class imbalance across tasks, the task aware temperature scaling module adaptively adjusts the temperature of logits at both class and instance levels based on task dynamics, which reduces the model's overconfidence in majority classes while enhancing its sensitivity to minority classes. Experimental results verified that FedCBDR achieves balanced class-wise sampling under heterogeneous data distributions and improves generalization under task imbalance between earlier and recent tasks, yielding a 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.
Chinese: FedCBDR通过全局协调机制实现均衡数据回放和自适应温度调节,有效解决联邦类增量学习中的类别不平衡问题,相比现有方法准确率提升高达15%。
English: FedCBDR addresses class imbalance in Federated Class Incremental Learning by implementing a global coordination mechanism for balanced data replay and adaptive temperature scaling, achieving up to 15% higher accuracy than existing methods.
Authors:Sebastian Lotter, Elisabeth Mohr, Andrina Rutsch, Lukas Brand, Francesca Ronchi, Laura DÃaz-Marugán
Abstract:
Synthetic molecular communication (SMC) is a key enabler for future healthcare systems in which Internet of Bio-Nano-Things (IoBNT) devices facilitate the continuous monitoring of a patient's biochemical signals. To close the loop between sensing and actuation, both the detection and the generation of in-body molecular communication (MC) signals is key. However, generating signals inside the human body, e.g., via synthetic nanodevices, poses a challenge in SMC, due to technological obstacles as well as legal, safety, and ethical issues. Hence, this paper considers an SMC system in which signals are generated indirectly via the modulation of a natural in-body MC system, namely the gut-brain axis (GBA). Therapeutic GBA modulation is already established as treatment for neurological diseases, e.g., drug refractory epilepsy (DRE), and performed via the administration of nutritional supplements or specific diets. However, the molecular signaling pathways that mediate the effect of such treatments are mostly unknown. Consequently, existing treatments are standardized or designed heuristically and able to help only some patients while failing to help others. In this paper, we propose to leverage personal health data, e.g., gathered by in-body IoBNT devices, to design more versatile and robust GBA modulation-based treatments as compared to the existing ones. To show the feasibility of our approach, we define a catalog of theoretical requirements for therapeutic GBA modulation. Then, we propose a machine learning model to verify these requirements for practical scenarios when only limited data on the GBA modulation exists. By evaluating the proposed model on several datasets, we confirm its excellent accuracy in identifying different modulators of the GBA. Finally, we utilize the proposed model to identify specific modulatory pathways that play an important role for therapeutic GBA modulation.
中文摘要:本文提出了一种利用个人健康数据和机器学习调节肠脑轴的合成分子通信系统,旨在为神经系统疾病开发更有效和个性化的治疗方法。
English Summary: This paper proposes a synthetic molecular communication system that modulates the gut-brain axis using personal health data and machine learning to develop more effective and personalized treatments for neurological diseases.
Authors:Dongyu Wei, Xiaoren Xu, Yuchen Liu, H. Vincent Poor, Mingzhe Chen
Abstract:
In this paper, deceptive signal-assisted private split learning is investigated. In our model, several edge devices jointly perform collaborative training, and some eavesdroppers aim to collect the model and data information from devices. To prevent the eavesdroppers from collecting model and data information, a subset of devices can transmit deceptive signals. Therefore, it is necessary to determine the subset of devices used for deceptive signal transmission, the subset of model training devices, and the models assigned to each model training device. This problem is formulated as an optimization problem whose goal is to minimize the information leaked to eavesdroppers while meeting the model training energy consumption and delay constraints. To solve this problem, we propose a soft actor-critic deep reinforcement learning framework with intrinsic curiosity module and cross-attention (ICM-CA) that enables a centralized agent to determine the model training devices, the deceptive signal transmission devices, the transmit power, and sub-models assigned to each model training device without knowing the position and monitoring probability of eavesdroppers. The proposed method uses an ICM module to encourage the server to explore novel actions and states and a CA module to determine the importance of each historical state-action pair thus improving training efficiency. Simulation results demonstrate that the proposed method improves the convergence rate by up to 3x and reduces the information leaked to eavesdroppers by up to 13% compared to the traditional SAC algorithm.
中文: 本文提出了一种欺骗信号辅助的私有分割学习框架,采用带有内在好奇心模块和交叉注意力机制的软演员-评论家深度强化学习算法来优化设备选择和资源分配,相比传统方法实现了3倍的收敛速度提升和13%的信息泄露减少。
English: This paper proposes a deceptive signal-assisted private split learning framework that uses a soft actor-critic deep reinforcement learning algorithm with intrinsic curiosity and cross-attention modules to optimize device selection and resource allocation, achieving up to 3x faster convergence and 13% less information leakage compared to traditional methods.
Authors:Francesco Ferrini, Veronica Lachi, Antonio Longa, Bruno Lepri, Andrea Passerini
Abstract:
Graph Neural Networks (GNNs) often struggle to capture the link-specific structural patterns crucial for accurate link prediction, as their node-centric message-passing schemes overlook the subgraph structures connecting a pair of nodes. Existing methods to inject such structural context either incur high computational cost or rely on simplistic heuristics (e.g., common neighbor counts) that fail to model multi-hop dependencies. We introduce SP4LP (Shortest Path for Link Prediction), a novel framework that combines GNN-based node encodings with sequence modeling over shortest paths. Specifically, SP4LP first applies a GNN to compute representations for all nodes, then extracts the shortest path between each candidate node pair and processes the resulting sequence of node embeddings using a sequence model. This design enables SP4LP to capture expressive multi-hop relational patterns with computational efficiency. Empirically, SP4LP achieves state-of-the-art performance across link prediction benchmarks. Theoretically, we prove that SP4LP is strictly more expressive than standard message-passing GNNs and several state-of-the-art structural features methods, establishing it as a general and principled approach for link prediction in graphs.
中文: SP4LP是一种新颖的框架,通过将基于图神经网络的节点编码与最短路径的序列建模相结合,有效捕捉多跳依赖关系,在链接预测任务中实现了最优性能,并证明其表达能力超越现有方法。
English: SP4LP is a novel framework that enhances link prediction by integrating GNN-based node encodings with sequence modeling of shortest paths, capturing multi-hop dependencies efficiently and achieving state-of-the-art performance with proven superior expressiveness over existing methods.
Authors:Jiahao Chen, junhao li, Yiming Wang, Zhe Ma, Yi Jiang, Chunyi Zhou, Qingming Li, Tianyu Du, Shouling Ji
Abstract:
The proliferation of Low-Rank Adaptation (LoRA) models has democratized personalized text-to-image generation, enabling users to share lightweight models (e.g., personal portraits) on platforms like Civitai and Liblib. However, this "share-and-play" ecosystem introduces critical risks: benign LoRAs can be weaponized by adversaries to generate harmful content (e.g., political, defamatory imagery), undermining creator rights and platform safety. Existing defenses like concept-erasure methods focus on full diffusion models (DMs), neglecting LoRA's unique role as a modular adapter and its vulnerability to adversarial prompt engineering. To bridge this gap, we propose LoRAShield, the first data-free editing framework for securing LoRA models against misuse. Our platform-driven approach dynamically edits and realigns LoRA's weight subspace via adversarial optimization and semantic augmentation. Experimental results demonstrate that LoRAShield achieves remarkable effectiveness, efficiency, and robustness in blocking malicious generations without sacrificing the functionality of the benign task. By shifting the defense to platforms, LoRAShield enables secure, scalable sharing of personalized models, a critical step toward trustworthy generative ecosystems.
中文: LoRAShield作为首个无数据编辑框架,通过动态调整LoRA模型的权重子空间,在保持正常功能的同时有效阻止恶意内容生成,为生成式生态系统提供了可扩展的安全保障。
English: LoRAShield is a pioneering data-free framework that dynamically secures LoRA models against adversarial misuse by editing their weight subspace, effectively blocking harmful content generation while preserving benign functionality.
Authors:Wei Shen, Jiangbo Pei, Yi Peng, Xuchen Song, Yang Liu, Jian Peng, Haofeng Sun, Yunzhuo Hao, Peiyu Wang, Jianhao Zhang, Yahui Zhou
Abstract:
We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.
中文: Skywork-R1V3作为开创性的开源视觉语言模型,通过创新的强化学习后训练框架将纯文本大语言模型的推理能力成功迁移到视觉任务,其38B参数版本即可媲美顶尖闭源模型,在多模态推理领域实现重大突破。
English: Skywork-R1V3 is a groundbreaking open-source vision-language model that achieves state-of-the-art performance by transferring reasoning skills from text-only LLMs through an innovative RL post-training framework, enabling even its 38B parameter version to rival top closed-source models.
Authors:Neelay Joglekar, Fei Liu, Florian Richter, Michael C. Yip
Abstract:
Remote Center of Motion (RCM) robotic manipulators have revolutionized Minimally Invasive Surgery, enabling precise, dexterous surgical manipulation within the patient's body cavity without disturbing the insertion point on the patient. Accurate RCM tool control is vital for incorporating autonomous subtasks like suturing, blood suction, and tumor resection into robotic surgical procedures, reducing surgeon fatigue and improving patient outcomes. However, these cable-driven systems are subject to significant joint reading errors, corrupting the kinematics computation necessary to perform control. Although visual tracking with endoscopic cameras can correct errors on in-view joints, errors in the kinematic chain prior to the insertion point are irreparable because they remain out of view. No prior work has characterized the stability of control under these conditions. We fill this gap by designing a provably stable tracking-in-the-loop controller for the out-of-view portion of the RCM manipulator kinematic chain. We additionally incorporate this controller into a bilevel control scheme for the full kinematic chain. We rigorously benchmark our method in simulated and real world settings to verify our theoretical findings. Our work provides key insights into the next steps required for the transition from teleoperated to autonomous surgery.
中文: 本研究针对远程运动中心(RCM)机械臂在视野外部分存在的关节读数误差问题,设计了一种可证明稳定的闭环跟踪控制器,通过仿真和实际实验验证了其有效性,为推进自主手术技术提供了关键解决方案。
English: The study introduces a provably stable tracking-in-the-loop controller for the out-of-view portion of Remote Center of Motion (RCM) manipulators, addressing joint reading errors that compromise kinematic control, and validates its effectiveness through simulations and real-world experiments to advance autonomous surgical capabilities.
Authors:Feng Xia, Ciyuan Peng, Jing Ren, Falih Gozi Febrinanto, Renqiang Luo, Vidya Saikrishna, Shuo Yu, Xiangjie Kong
Abstract:
Graph learning has rapidly evolved into a critical subfield of machine learning and artificial intelligence (AI). Its development began with early graph-theoretic methods, gaining significant momentum with the advent of graph neural networks (GNNs). Over the past decade, progress in scalable architectures, dynamic graph modeling, multimodal learning, generative AI, explainable AI (XAI), and responsible AI has broadened the applicability of graph learning to various challenging environments. Graph learning is significant due to its ability to model complex, non-Euclidean relationships that traditional machine learning struggles to capture, thus better supporting real-world applications ranging from drug discovery and fraud detection to recommender systems and scientific reasoning. However, challenges like scalability, generalization, heterogeneity, interpretability, and trustworthiness must be addressed to unlock its full potential. This survey provides a comprehensive introduction to graph learning, focusing on key dimensions including scalable, temporal, multimodal, generative, explainable, and responsible graph learning. We review state-of-the-art techniques for efficiently handling large-scale graphs, capturing dynamic temporal dependencies, integrating heterogeneous data modalities, generating novel graph samples, and enhancing interpretability to foster trust and transparency. We also explore ethical considerations, such as privacy and fairness, to ensure responsible deployment of graph learning models. Additionally, we identify and discuss emerging topics, highlighting recent integration of graph learning and other AI paradigms and offering insights into future directions. This survey serves as a valuable resource for researchers and practitioners seeking to navigate the rapidly evolving landscape of graph learning.
中文摘要:图学习已成为人工智能的关键分支领域,能够建模传统机器学习难以处理的复杂非欧几里得关系以支持实际应用,尽管面临可扩展性和可解释性等挑战,本综述通过前沿技术和伦理考量全面探讨了这些议题。
English Summary: Graph learning has become a vital AI subfield that models complex non-Euclidean relationships for real-world applications, though it faces challenges in scalability and interpretability that this comprehensive survey addresses through cutting-edge techniques and ethical considerations.
Authors:Gehao Zhang, Eugene Bagdasarian, Juan Zhai, Shiqing Ma
Abstract:
Distinguishing AI-generated code from human-written code is becoming crucial for tasks such as authorship attribution, content tracking, and misuse detection. Based on this, N-gram-based watermarking schemes have emerged as prominent, which inject secret watermarks to be detected during the generation.
However, their robustness in code content remains insufficiently evaluated. Most claims rely solely on defenses against simple code transformations or code optimizations as a simulation of attack, creating a questionable sense of robustness. In contrast, more sophisticated schemes already exist in the software engineering world, e.g., code obfuscation, which significantly alters code while preserving functionality. Although obfuscation is commonly used to protect intellectual property or evade software scanners, the robustness of code watermarking techniques against such transformations remains largely unexplored.
In this work, we formally model the code obfuscation and prove the impossibility of N-gram-based watermarking's robustness with only one intuitive and experimentally verified assumption, distribution consistency, satisfied. Given the original false positive rate of the watermarking detection, the ratio that the detector failed on the watermarked code after obfuscation will increase to 1 - fpr.
The experiments have been performed on three SOTA watermarking schemes, two LLMs, two programming languages, four code benchmarks, and four obfuscators. Among them, all watermarking detectors show coin-flipping detection abilities on obfuscated codes (AUROC tightly surrounds 0.5). Among all models, watermarking schemes, and datasets, both programming languages own obfuscators that can achieve attack effects with no detection AUROC higher than 0.6 after the attack. Based on the theoretical and practical observations, we also proposed a potential path of robust code watermarking.
This study demonstrates that N-gram-based watermarking schemes for AI-generated code are fundamentally vulnerable to code obfuscation attacks, with theoretical proofs and extensive experiments showing detection performance degrades to random chance levels across multiple programming languages and state-of-the-art watermarking methods.
English Summary:
Authors:Ashwin Rao, Sze Yuh Nina Wang, Kristina Lerman
Abstract:
The U.S. Supreme Court's 2022 ruling in Dobbs v. Jackson Women's Health Organization marked a turning point in the national debate over reproductive rights. While the ideological divide over abortion is well documented, less is known about how gender and local sociopolitical contexts interact to shape public discourse. Drawing on nearly 10 million abortion-related posts on X (formerly Twitter) from users with inferred gender, ideology and location, we show that gender significantly moderates abortion attitudes and emotional expression, particularly in conservative regions, and independently of ideology. This creates a gender gap in abortion attitudes that grows more pronounced in conservative regions. The leak of the Dobbs draft opinion further intensified online engagement, disproportionately mobilizing pro-abortion women in areas where access was under threat. These findings reveal that abortion discourse is not only ideologically polarized but also deeply structured by gender and place, highlighting the central role of identity in shaping political expression during moments of institutional disruption.
中文: 2022年多布斯案裁决激化了网络堕胎议题讨论,研究表明性别与地域背景显著影响立场表达,保守地区性别分歧尤为突出,在堕胎权受威胁区域支持堕权的女性群体参与度急剧上升。
English: The 2022 Dobbs ruling intensified online abortion discourse, revealing that gender and geographic context significantly shape attitudes and emotional expression, with conservative regions showing a widening gender gap and heightened mobilization of pro-abortion women where access was threatened.
Authors:Xuanqi Gao, Juan Zhai, Shiqing Ma, Siyi Xie, Chao Shen
Abstract:
The integration of Large Language Models (LLMs) into browser extensions has revolutionized web browsing, enabling sophisticated functionalities like content summarization, intelligent translation, and context-aware writing assistance. However, these AI-powered extensions introduce unprecedented challenges in testing and reliability assurance. Traditional browser extension testing approaches fail to address the non-deterministic behavior, context-sensitivity, and complex web environment integration inherent to LLM-powered extensions. Similarly, existing LLM testing methodologies operate in isolation from browser-specific contexts, creating a critical gap in effective evaluation frameworks. To bridge this gap, we present ASSURE, a modular automated testing framework specifically designed for AI-powered browser extensions. ASSURE comprises three principal components: (1) a modular test case generation engine that supports plugin-based extension of testing scenarios, (2) an automated execution framework that orchestrates the complex interactions between web content, extension processing, and AI model behavior, and (3) a configurable validation pipeline that systematically evaluates behavioral consistency and security invariants rather than relying on exact output matching. Our evaluation across six widely-used AI browser extensions demonstrates ASSURE's effectiveness, identifying 531 distinct issues spanning security vulnerabilities, metamorphic relation violations, and content alignment problems. ASSURE achieves 6.4x improved testing throughput compared to manual approaches, detecting critical security vulnerabilities within 12.4 minutes on average. This efficiency makes ASSURE practical for integration into development pipelines, offering a comprehensive solution to the unique challenges of testing AI-powered browser extensions.
中文:ASSURE框架通过自动化、模块化测试解决AI浏览器扩展的独特测试难题,评估行为一致性和安全性,相比人工方法显著提升了效率和问题检测能力。
English: The ASSURE framework addresses the unique testing challenges of AI-powered browser extensions by providing automated, modular testing that evaluates behavioral consistency and security, significantly improving efficiency and issue detection over manual methods.
Authors:Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Haotian Zhu, Yuanxing Zhang, Yuhao Jiang, Yue Zhang, Zenan Xu, Bohui Zhai, Guoxiang He, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan, Pengyu Guo, Xianshu Pang, Yang Ruan, Zhifeng Zhang, Zhonghu Wang, Ziyan Xu, Zuopu Yin, Wiggin Zhou, Chayse Zhou, Fengzong Lian
Abstract:
The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.
中文: ArtifactsBench作为新的基准和框架,通过时序截图和多模态大语言模型评估,实现了与人类判断高度一致的全自动视觉代码生成评估,并揭示了当前生成模型的技术现状。
English: ArtifactsBench is introduced as a new benchmark and framework for automated, multimodal evaluation of visual code generation, using temporal screenshots and MLLM-as-Judge to achieve high consistency with human assessments and map the current state of the art in generative models.
Authors:Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Changzhi Zhou, Ken Deng, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Shihui Hu, Yue Zhang, Yuhao Jiang, Zenan Xu, Yuanxing Zhang, Wiggin Zhou, Chayse Zhou, Fengzong Lian
Abstract:
The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.
中文: ArtifactsBench作为新的基准和框架,通过时序截图和多模态大语言模型评估,实现了与人类判断高度一致的全自动视觉代码生成评估,并揭示了当前生成模型的技术现状。
English: ArtifactsBench is introduced as a new benchmark and framework for automated, multimodal evaluation of visual code generation, using temporal screenshots and MLLM-as-Judge to achieve high consistency with human assessments and map the current state of the art in generative models.
Authors:Jun Hu, Yufei He, Yuan Li, Bryan Hooi, Bingsheng He
Abstract:
Cold-start node classification on multimodal graphs is challenging because cold-start nodes are isolated (i.e., no edges) and often have missing modalities (e.g., absent text or image features). Existing methods address structural isolation by degrading graph learning models to MLPs for cold-start inference, using a teacher model (with graph access) to guide the MLP. However, this results in limited model capacity in the student, which is further challenged when modalities are missing. In this paper, we propose Neighbor-to-Self Graph Transformer (NTSFormer), a unified Graph Transformer framework that jointly tackles the isolation and missing-modality issues via a self-teaching paradigm. Specifically, NTSFormer uses a cold-start attention mask to simultaneously make two predictions for each node: a "student" prediction based only on self-information (i.e., the node's own features), and a "teacher" prediction incorporating both self and neighbor information. This enables the model to supervise itself without degrading to an MLP, thereby fully leveraging the Transformer's capacity to handle missing modalities. To handle diverse graph information and missing modalities, NTSFormer performs a one-time multimodal graph pre-computation that converts structural and feature data into token sequences, which are then processed by a Mixture-of-Experts (MoE) Input Projection and Transformer layers for effective fusion. Experimental results on public datasets show that NTSFormer achieves superior performance on multimodal cold-start node classification tasks.
中文: 本文提出的NTSFormer通过自教学范式,同时利用孤立节点特征生成学生预测和结合邻居信息的教师预测,采用多模态标记化与专家混合处理,有效解决了多模态图中冷启动节点的结构孤立和模态缺失问题。
English: The proposed Neighbor-to-Self Graph Transformer (NTSFormer) addresses cold-start node classification challenges by employing a self-teaching paradigm that simultaneously generates student predictions from isolated node features and teacher predictions incorporating neighbor information, effectively handling both structural isolation and missing modalities through multimodal tokenization and Mixture-of-Experts processing.
Authors:Chinmay Prabhakar, Suprosanna Shit, Tamaz Amiranashvili, Hongwei Bran Li, Bjoern Menze
Abstract:
3D spatial graphs play a crucial role in biological and clinical research by modeling anatomical networks such as blood vessels,neurons, and airways. However, generating 3D biological graphs while maintaining anatomical validity remains challenging, a key limitation of existing diffusion-based methods. In this work, we propose a novel 3D biological graph generation method that adheres to structural and semantic plausibility conditions. We achieve this by using a novel projection operator during sampling that stochastically fixes inconsistencies. Further, we adopt a superior edge-deletion-based noising procedure suitable for sparse biological graphs. Our method demonstrates superior performance on two real-world datasets, human circle of Willis and lung airways, compared to previous approaches. Importantly, we demonstrate that the generated samples significantly enhance downstream graph labeling performance. Furthermore, we show that our generative model is a reasonable out-of-the-box link predictior.
中文: 本研究提出了一种新颖的三维生物图生成方法,通过随机投影算子和基于边删除的噪声处理确保解剖有效性,在生成真实血管与气道网络方面表现优异,并能显著提升下游任务性能。
English: This study introduces a novel 3D biological graph generation method that ensures anatomical validity through a stochastic projection operator and edge-deletion-based noising, demonstrating superior performance in generating realistic vascular and airway networks while enhancing downstream tasks.
Authors:Jing Liang, Kasun Weerakoon, Daeun Song, Senthurbavan Kirubaharan, Xuesu Xiao, Dinesh Manocha
Abstract:
We present MOSU, a novel autonomous long-range navigation system that enhances global navigation for mobile robots through multimodal perception and on-road scene understanding. MOSU addresses the outdoor robot navigation challenge by integrating geometric, semantic, and contextual information to ensure comprehensive scene understanding. The system combines GPS and QGIS map-based routing for high-level global path planning and multi-modal trajectory generation for local navigation refinement. For trajectory generation, MOSU leverages multi-modalities: LiDAR-based geometric data for precise obstacle avoidance, image-based semantic segmentation for traversability assessment, and Vision-Language Models (VLMs) to capture social context and enable the robot to adhere to social norms in complex environments. This multi-modal integration improves scene understanding and enhances traversability, allowing the robot to adapt to diverse outdoor conditions. We evaluate our system in real-world on-road environments and benchmark it on the GND dataset, achieving a 10% improvement in traversability on navigable terrains while maintaining a comparable navigation distance to existing global navigation methods.
Chinese: MOSU是一种创新的移动机器人自主导航系统,通过融合激光雷达、语义分割和视觉语言模型等多模态感知技术,改进了全局路径规划和局部轨迹优化,在户外环境中实现了可通行性10%的提升。
English: MOSU is an innovative autonomous navigation system for mobile robots that integrates multimodal perception, including LiDAR, semantic segmentation, and vision-language models, to enhance global path planning and local trajectory refinement, achieving a 10% improvement in traversability on outdoor terrains.
Authors:Yusong Zhang, Yuxuan Sun, Lei Guo, Wei Chen, Bo Ai, Deniz Gunduz
Abstract:
6G networks promise revolutionary immersive communication experiences including augmented reality (AR), virtual reality (VR), and holographic communications. These applications demand high-dimensional multimodal data transmission and intelligent data processing in real-time, which is extremely challenging over resource-limited wireless communication systems. Moreover, a joint understanding of the environment, context, and user intent is essential to deliver task-relevant content effectively. This article presents a novel multimodal large language model (MLLM) integrated semantic communications framework, termed MLLM-SC, which fully leverages reasoning and generative capabilities of pre-trained foundation models for context-aware and task-oriented wireless communication. The MLLM-SC framework adopts a device-edge collaborative architecture. At the edge, MLLM-empowered semantic guidance module analyzes multimodal inputs, user intents, and channel conditions to generate importance-aware attention maps prioritizing semantically critical information. An importance-aware semantic encoder and a resource-adaptive semantic decoder are jointly designed and optimized, which can utilize the semantic guidance for adaptive bandwidth allocation and high-quality content reconstruction or generation. Extensive case studies on visual question answering for AR/VR applications and diffusion-driven image generation validate the effectiveness of MLLM-SC.
Chinese: MLLM-SC框架利用多模态大语言模型实现6G网络中智能、情境感知的语义通信,通过端边协同优化资源分配,显著提升AR/VR等沉浸式应用体验。
English: The MLLM-SC framework leverages multimodal large language models to enable intelligent, context-aware semantic communications for 6G networks, optimizing resource allocation and enhancing immersive applications like AR/VR through device-edge collaboration.
Authors:Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan
Abstract:
We develop a theory of transfer learning in infinitely wide neural networks where both the pretraining (source) and downstream (target) task can operate in a feature learning regime. We analyze both the Bayesian framework, where learning is described by a posterior distribution over the weights, and gradient flow training of randomly initialized networks trained with weight decay. Both settings track how representations evolve in both source and target tasks. The summary statistics of these theories are adapted feature kernels which, after transfer learning, depend on data and labels from both source and target tasks. Reuse of features during transfer learning is controlled by an elastic weight coupling which controls the reliance of the network on features learned during training on the source task. We apply our theory to linear and polynomial regression tasks as well as real datasets. Our theory and experiments reveal interesting interplays between elastic weight coupling, feature learning strength, dataset size, and source and target task alignment on the utility of transfer learning.
中文: 本研究提出了无限宽神经网络中的迁移学习理论,通过贝叶斯和梯度流框架分析预训练与下游任务中特征表示的演变,揭示了弹性权重耦合与任务对齐如何影响特征重用。
English: This study presents a theory of transfer learning in infinitely wide neural networks, analyzing how feature representations evolve during pretraining and downstream tasks through Bayesian and gradient flow frameworks, revealing how elastic weight coupling and task alignment impact feature reuse.
Authors:Hebei Li, Yansong Peng, Jiahui Yuan, Peixi Wu, Jin Wang, Yueyi Zhang, Xiaoyan Sun
Abstract:
Event cameras have recently been introduced into image semantic segmentation, owing to their high temporal resolution and other advantageous properties. However, existing event-based semantic segmentation methods often fail to fully exploit the complementary information provided by frames and events, resulting in complex training strategies and increased computational costs. To address these challenges, we propose an efficient hybrid framework for image semantic segmentation, comprising a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames. Specifically, we introduce three specialized modules to facilitate the interaction between these two branches: the Adaptive Temporal Weighting (ATW) Injector, the Event-Driven Sparse (EDS) Injector, and the Channel Selection Fusion (CSF) module. The ATW Injector dynamically integrates temporal features from event data into frame features, enhancing segmentation accuracy by leveraging critical dynamic temporal information. The EDS Injector effectively combines sparse event data with rich frame features, ensuring precise temporal and spatial information alignment. The CSF module selectively merges these features to optimize segmentation performance. Experimental results demonstrate that our framework not only achieves state-of-the-art accuracy across the DDD17-Seg, DSEC-Semantic, and M3ED-Semantic datasets but also significantly reduces energy consumption, achieving a 65\% reduction on the DSEC-Semantic dataset.
中文: 本文提出了一种高效的图像语义分割混合框架,通过专用模块整合脉冲与人工神经网络,在基准数据集上实现了最优精度和65%的能耗降低。
English: This paper introduces an efficient hybrid framework for image semantic segmentation that integrates spiking and artificial neural networks through specialized modules, achieving state-of-the-art accuracy and a 65% energy reduction on benchmark datasets.
Authors:Nura Aljaafari, Danilo S. Carvalho, André Freitas
Abstract:
Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as early syntactic emergence, delayed semantic acquisition, and representational compression, signals overlooked by traditional scalar metrics such as loss or accuracy. With minimal integration effort, the tool enables layer-wise diagnostics, convergence-based early stopping, and detection of structural errors, making transformer analysis interpretable, actionable, and reproducible.
Chinese: TRACE是一个模块化工具包,可在训练过程中对Transformer模型进行轻量级可解释性分析,通过集成ABSynth等工具精确评估语言特征习得,揭示传统指标忽略的句法早期涌现、语义延迟习得等发展现象。
English: TRACE is a modular toolkit that enables lightweight, in-training interpretability analysis of transformer models, revealing developmental phenomena like early syntactic emergence and delayed semantic acquisition through integrated tools such as ABSynth for precise linguistic evaluation.
Authors:Daniel Dittler, Peter Frank, Gary Hildebrandt, Luisa Peterson, Nasser Jazdi, Michael Weyrich
Abstract:
Offshore Power-to-X platforms enable flexible conversion of renewable energy, but place high demands on adaptive process control due to volatile operating conditions. To face this challenge, using Digital Twins in Power-to-X platforms is a promising approach. Comprehensive knowledge integration in Digital Twins requires the combination of heterogeneous models and a structured representation of model information. The proposed approach uses a standardized description of behavior models, semantic technologies and a graph-based model understanding to enable automatic adaption and selection of suitable models. It is implemented using a graph-based knowledge representation with Neo4j, automatic data extraction from Asset Administration Shells and port matching to ensure compatible model configurations.
中文摘要:该方法在数字孪生中采用标准化行为模型、语义技术和基于图的理解,通过Neo4j知识图谱和自动化数据集成,实现海上Power-to-X平台的自适应调控与模型匹配。
English Summary: The approach utilizes standardized behavior models, semantic technologies, and graph-based understanding in Digital Twins to enable automatic adaptation and model selection for offshore Power-to-X platforms, implemented through Neo4j knowledge graphs and automated data integration.
Authors:Johannes Sigel, Daniel Dittler, Nasser Jazdi, Michael Weyrich
Abstract:
Achieving greater autonomy in automation systems is crucial for handling unforeseen situations effectively. However, this remains challenging due to technological limitations and the complexity of real-world environments. This paper examines the need for increased autonomy, defines the problem, and outlines key enabling technologies. A structured concept is proposed, consisting of three main steps: context enrichment, situation analysis, and generation of solution strategies. By following this approach, automation systems can make more independent decisions, reducing the need for human intervention. Additionally, possible realizations of the concept are discussed, especially the use of Large Language Models. While certain tasks may still require human assistance, the proposed approach significantly enhances the autonomy of automation systems, enabling more adaptive and intelligent problem-solving capabilities.
中文摘要:本文提出一个包含情境丰富、态势分析和解决方案生成的三步结构化概念,通过利用大语言模型等技术增强自动化系统自主性,在减少人工干预的同时实现自适应问题解决能力。
English Summary: This paper proposes a structured three-step concept to enhance automation system autonomy through context enrichment, situation analysis, and solution generation, leveraging technologies like Large Language Models to reduce human intervention while enabling adaptive problem-solving.
Authors:Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while LLMs remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have aimed to enhance models' search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to reliance on single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, with the aim of reducing inference time and enhancing the model's capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.
Large Language Models face issues with hallucinations and outdated knowledge, but the proposed RAG-R1 framework enhances their reasoning by adaptively combining internal and external knowledge through multi-query parallelism, improving accuracy by up to 13.2% and reducing inference time by 11.1%.
English Summary:
Authors:Renhao Wang, Haoran Geng, Tingle Li, Feishi Wang, Gopala Anumanchipalli, Trevor Darrell, Boyi Li, Pieter Abbeel, Jitendra Malik, Alexei A. Efros
Abstract:
Robots must integrate multiple sensory modalities to act effectively in the real world. Yet, learning such multimodal policies at scale remains challenging. Simulation offers a viable solution, but while vision has benefited from high-fidelity simulators, other modalities (e.g. sound) can be notoriously difficult to simulate. As a result, sim-to-real transfer has succeeded primarily in vision-based tasks, with multimodal transfer still largely unrealized. In this work, we tackle these challenges by introducing MultiGen, a framework that integrates large-scale generative models into traditional physics simulators, enabling multisensory simulation. We showcase our framework on the dynamic task of robot pouring, which inherently relies on multimodal feedback. By synthesizing realistic audio conditioned on simulation video, our method enables training on rich audiovisual trajectories -- without any real robot data. We demonstrate effective zero-shot transfer to real-world pouring with novel containers and liquids, highlighting the potential of generative modeling to both simulate hard-to-model modalities and close the multimodal sim-to-real gap.
中文摘要:MultiGen框架将生成模型与物理模拟器结合,实现了多感官模拟,使机器人能够在无需真实数据的情况下,通过视听策略完成如倒水等任务并实现零样本迁移。
English Summary: The MultiGen framework integrates generative models with physics simulators to create multisensory simulations, enabling zero-shot transfer of audiovisual robotic policies to real-world tasks like pouring without requiring actual robot data.
Authors:Jiawei He, Danshi Li, Xinqiang Yu, Zekun Qi, Wenyao Zhang, Jiayi Chen, Zhaoxiang Zhang, Zhizheng Zhang, Li Yi, He Wang
Abstract:
As large models gain traction, vision-language-action (VLA) systems are enabling robots to tackle increasingly complex tasks. However, limited by the difficulty of data collection, progress has mainly focused on controlling simple gripper end-effectors. There is little research on functional grasping with large models for human-like dexterous hands. In this paper, we introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions using single-view RGBD input. To accomplish this, we generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions. This large-scale dataset, named DexGraspNet 3.0, is used to train a VLM and flow-matching-based pose head capable of producing instruction-aligned grasp poses for tabletop objects. To assess DexVLG's performance, we create benchmarks in physics-based simulations and conduct real-world experiments. Extensive testing demonstrates DexVLG's strong zero-shot generalization capabilities-achieving over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation-and successful part-aligned grasps on physical objects in real-world scenarios.
中文: DexVLG是一种大型视觉-语言-抓取模型,通过单视角RGBD输入预测符合语言指令的灵巧抓取姿态,基于1.7亿个抓取姿态的数据集训练,在仿真和现实实验中展现出超过76%的零样本执行成功率的强泛化能力。
English: DexVLG is a large vision-language-grasp model that predicts dexterous grasp poses aligned with language instructions using single-view RGBD input, trained on a 170-million-pose dataset and demonstrating strong zero-shot generalization with over 76% success rate in simulation and real-world experiments.
Authors:Zilin Kang, Chenyuan Hu, Yu Luo, Zhecheng Yuan, Ruijie Zheng, Huazhe Xu
Abstract:
Deep reinforcement learning for continuous control has recently achieved impressive progress. However, existing methods often suffer from primacy bias, a tendency to overfit early experiences stored in the replay buffer, which limits an RL agent's sample efficiency and generalizability. In contrast, humans are less susceptible to such bias, partly due to infantile amnesia, where the formation of new neurons disrupts early memory traces, leading to the forgetting of initial experiences. Inspired by this dual processes of forgetting and growing in neuroscience, in this paper, we propose Forget and Grow (FoG), a new deep RL algorithm with two mechanisms introduced. First, Experience Replay Decay (ER Decay) "forgetting early experience", which balances memory by gradually reducing the influence of early experiences. Second, Network Expansion, "growing neural capacity", which enhances agents' capability to exploit the patterns of existing data by dynamically adding new parameters during training. Empirical results on four major continuous control benchmarks with more than 40 tasks demonstrate the superior performance of FoG against SoTA existing deep RL algorithms, including BRO, SimBa, and TD-MPC2.
Chinese: 提出的"遗忘与成长"(FoG)算法通过经验回放衰减和动态网络扩展机制,解决了深度强化学习中的初始偏差问题,在连续控制基准测试中展现出优于现有先进方法的性能。
English: The proposed Forget and Grow (FoG) algorithm addresses primacy bias in deep reinforcement learning by incorporating experience replay decay and dynamic network expansion, demonstrating superior performance on continuous control benchmarks against state-of-the-art methods.
Authors:Harry Cheng, Ming-Hui Liu, Yangyang Guo, Tianyi Wang, Liqiang Nie, Mohan Kankanhalli
Abstract:
Deepfake detection models face two critical challenges: generalization to unseen manipulations and demographic fairness among population groups. However, existing approaches often demonstrate that these two objectives are inherently conflicting, revealing a trade-off between them. In this paper, we, for the first time, uncover and formally define a causal relationship between fairness and generalization. Building on the back-door adjustment, we show that controlling for confounders (data distribution and model capacity) enables improved generalization via fairness interventions. Motivated by this insight, we propose Demographic Attribute-insensitive Intervention Detection (DAID), a plug-and-play framework composed of: i) Demographic-aware data rebalancing, which employs inverse-propensity weighting and subgroup-wise feature normalization to neutralize distributional biases; and ii) Demographic-agnostic feature aggregation, which uses a novel alignment loss to suppress sensitive-attribute signals. Across three cross-domain benchmarks, DAID consistently achieves superior performance in both fairness and generalization compared to several state-of-the-art detectors, validating both its theoretical foundation and practical effectiveness.
中文: 本文提出DAID框架,通过因果干预控制混杂因素解决深度伪造检测中泛化性与公平性的权衡问题,在多项基准测试中均实现了更优的泛化性能和公平性表现。
English: This paper introduces DAID, a plug-and-play framework that addresses the trade-off between generalization and fairness in deepfake detection by using causal interventions to control confounders, achieving superior performance in both aspects across benchmarks.
Authors:Alina F. Dima, Suprosanna Shit, Huaqi Qiu, Robbie Holland, Tamara T. Mueller, Fabio Antonio Musio, Kaiyuan Yang, Bjoern Menze, Rickmer Braren, Marcus Makowski, Daniel Rueckert
Abstract:
Vessels are complex structures in the body that have been studied extensively in multiple representations. While voxelization is the most common of them, meshes and parametric models are critical in various applications due to their desirable properties. However, these representations are typically extracted through segmentations and used disjointly from each other. We propose a framework that joins the three representations under differentiable transformations. By leveraging differentiable voxelization, we automatically extract a parametric shape model of the vessels through shape-to-segmentation fitting, where we learn shape parameters from segmentations without the explicit need for ground-truth shape parameters. The vessel is parametrized as centerlines and radii using cubic B-splines, ensuring smoothness and continuity by construction. Meshes are differentiably extracted from the learned shape parameters, resulting in high-fidelity meshes that can be manipulated post-fit. Our method can accurately capture the geometry of complex vessels, as demonstrated by the volumetric fits in experiments on aortas, aneurysms, and brain vessels.
中文摘要:本文提出了一种通过可微分变换将血管的体素、网格和参数化表示统一整合的框架,能够从分割结果中自动提取无需真实形状参数指导的光滑参数模型,并为复杂血管结构生成高保真度的网格。
English Summary: This paper introduces a unified framework that integrates voxel, mesh, and parametric representations of vessels through differentiable transformations, enabling automatic extraction of smooth parametric models from segmentations without ground-truth shape parameters and producing high-fidelity meshes for complex vascular structures.
Authors:Buzhen Huang, Chen Li, Chongyang Xu, Dongyue Lu, Jinnan Chen, Yangang Wang, Gim Hee Lee
Abstract:
Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. Based on this observation, we propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws. Specifically, we first train a diffusion model to learn the human proxemic behavior and pose prior knowledge. The trained network and two optimizable tensors are then incorporated into a dual-branch optimization framework to reconstruct human motions and appearances. Several constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to assist the optimization. With the proxemics prior and diverse constraints, our method is capable of estimating accurate interactions from in-the-wild videos captured in complex environments. We further build a dataset with pseudo ground-truth interaction annotations, which may promote future research on pose estimation and human behavior understanding. Experimental results on several benchmarks demonstrate that our method outperforms existing approaches. The code and data are available at https://www.buzhenhuang.com/works/CloseApp.html.
中文: 本研究提出了一种双分支优化框架,利用人类外观线索从野外视频中重建精确的交互动作,通过社交距离、物理规律和外观约束克服了视觉模糊和遮挡问题。
English: This study introduces a dual-branch optimization framework that leverages human appearance cues to reconstruct accurate interactive motions from in-the-wild videos, overcoming ambiguities and occlusions through constraints on proxemics, physics, and appearance.
Authors:Shivam Chaubey, Francesco Verdoja, Shankar Deka, Ville Kyrki
Abstract:
Shared control combines human intention with autonomous decision-making, from low-level safety overrides to high-level task guidance, enabling systems that adapt to users while ensuring safety and performance. This enhances task effectiveness and user experience across domains such as assistive robotics, teleoperation, and autonomous driving. However, existing shared control methods, based on e.g. Model Predictive Control, Control Barrier Functions, or learning-based control, struggle with feasibility, scalability, or safety guarantees, particularly since the user input is unpredictable.
To address these challenges, we propose an assistive controller framework based on Constrained Optimal Control Problem that incorporates an offline-computed Control Invariant Set, enabling online computation of control actions that ensure feasibility, strict constraint satisfaction, and minimal override of user intent. Moreover, the framework can accommodate structured class of non-convex constraints, which are common in real-world scenarios. We validate the approach through a large-scale user study with 66 participants--one of the most extensive in shared control research--using a computer game environment to assess task load, trust, and perceived control, in addition to performance. The results show consistent improvements across all these aspects without compromising safety and user intent.
中文摘要:本研究提出的辅助控制框架通过离线计算控制不变集,解决了现有共享控制方法在可行性、安全性和用户意图覆盖方面的不足,大规模用户实验证明该框架在保证安全的同时显著提升了任务性能、用户信任度和控制感知。
English Summary: The proposed assistive controller framework overcomes limitations of existing shared control methods by ensuring safety, feasibility, and minimal user intent override through an offline-computed Control Invariant Set, with large-scale user studies confirming improved performance, trust, and perceived control.
Authors:Yoav Gelberg, Yam Eitan, Aviv Navon, Aviv Shamsian, Theo, Putterman, Michael Bronstein, Haggai Maron
Abstract:
Gradients of neural networks encode valuable information for optimization, editing, and analysis of models. Therefore, practitioners often treat gradients as inputs to task-specific algorithms, e.g. for pruning or optimization. Recent works explore learning algorithms that operate directly on gradients but use architectures that are not specifically designed for gradient processing, limiting their applicability. In this paper, we present a principled approach for designing architectures that process gradients. Our approach is guided by three principles: (1) equivariant design that preserves neuron permutation symmetries, (2) processing sets of gradients across multiple data points to capture curvature information, and (3) efficient gradient representation through rank-1 decomposition. Based on these principles, we introduce GradMetaNet, a novel architecture for learning on gradients, constructed from simple equivariant blocks. We prove universality results for GradMetaNet, and show that previous approaches cannot approximate natural gradient-based functions that GradMetaNet can. We then demonstrate GradMetaNet's effectiveness on a diverse set of gradient-based tasks on MLPs and transformers, such as learned optimization, INR editing, and estimating loss landscape curvature.
中文摘要:本文提出了GradMetaNet,一种专为处理神经网络梯度而设计的架构,通过等变性、多点梯度集和高效表示等原则,在多种基于梯度的任务中超越了现有方法。
English Summary: This paper introduces GradMetaNet, a principled architecture designed specifically for processing neural network gradients, incorporating equivariance, multi-point gradient sets, and efficient representations to outperform existing methods in various gradient-based tasks.
Authors:Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou
Abstract:
Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.
中文: 现有奖励模型因偏好数据集局限而表现不佳,但通过人机协同构建的SynPref-40M数据集和Skywork-Reward-V2模型,在七大基准测试中实现了最优性能。
English: Current reward models underperform due to limited preference datasets, but the newly developed SynPref-40M dataset and Skywork-Reward-V2 models achieve state-of-the-art performance through human-AI collaborative curation.
Authors:Chiyu Hao, Jixian Su, Shixuan Sun, Hao Zhang, Sen Gao, Jianwen Zhao, Chenyi Zhang, Jieru Zhao, Chen Chen, Minyi Guo
Abstract:
Dynamic graph storage systems are essential for real-time applications such as social networks and recommendation, where graph data continuously evolves. However, they face significant challenges in efficiently handling concurrent read and write operations. We find that existing methods suffer from write queries interfering with read efficiency, substantial time and space overhead due to per-edge versioning, and an inability to balance performance, such as slow searches under concurrent workloads. To address these issues, we propose RapidStore, a holistic approach for efficient in-memory dynamic graph storage designed for read-intensive workloads. Our key idea is to exploit the characteristics of graph queries through a decoupled system design that separates the management of read and write queries and decouples version data from graph data. Particularly, we design an efficient dynamic graph store to cooperate with the graph concurrency control mechanism. Experimental results demonstrate that RapidStore enables fast and scalable concurrent graph queries, effectively balancing the performance of inserts, searches, and scans, and significantly improving efficiency in dynamic graph storage systems.
中文摘要:RapidStore是一种高效的内存动态图存储系统,通过读写管理解耦和版本数据与图数据分离的设计,有效解决了并发操作中的性能问题,显著提升了查询速度与系统可扩展性。
English Summary: RapidStore is an efficient in-memory dynamic graph storage system that addresses performance issues in concurrent operations by decoupling read and write management and separating version data from graph data, achieving significant improvements in query speed and scalability.
Authors:Haoxuan Jiang, Peicong Qian, Yusen Xie, Xiaocong Li, Ming Liu, Jun Ma
Abstract:
LiDAR-based localization serves as a critical component in autonomous systems, yet existing approaches face persistent challenges in balancing repeatability, accuracy, and environmental adaptability. Traditional point cloud registration methods relying solely on offline maps often exhibit limited robustness against long-term environmental changes, leading to localization drift and reliability degradation in dynamic real-world scenarios. To address these challenges, this paper proposes DuLoc, a robust and accurate localization method that tightly couples LiDAR-inertial odometry with offline map-based localization, incorporating a constant-velocity motion model to mitigate outlier noise in real-world scenarios. Specifically, we develop a LiDAR-based localization framework that seamlessly integrates a prior global map with dynamic real-time local maps, enabling robust localization in unbounded and changing environments. Extensive real-world experiments in ultra unbounded port that involve 2,856 hours of operational data across 32 Intelligent Guided Vehicles (IGVs) are conducted and reported in this study. The results attained demonstrate that our system outperforms other state-of-the-art LiDAR localization systems in large-scale changing outdoor environments.
中文: 本文提出DuLoc方法,通过紧密融合激光雷达惯性里程计与离线地图定位,结合恒定速度运动模型,有效提升了动态环境中定位的鲁棒性和精度,在大规模实际场景测试中表现优于现有先进系统。
English: This paper introduces DuLoc, a robust LiDAR-based localization method that integrates LiDAR-inertial odometry with offline maps to enhance accuracy and adaptability in dynamic environments, demonstrating superior performance in large-scale real-world tests.
Authors:Ante Wang, Yujie Lin, Jingyao Liu, Suhang Wu, Hao Liu, Xinyan Xiao, Jinsong Su
Abstract:
Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B's accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.
中文摘要:本文提出主动批判性思维,使AI模型能主动寻求缺失信息来解决用户问题,并通过GSM-MC/MC基准测试表明,虽然现有模型在此能力上表现不佳,但强化学习可显著提升其性能。
English Summary: This paper introduces proactive critical thinking, where AI models actively seek missing information to resolve user queries, and presents GSM-MC/MC benchmarks showing that while current models struggle with this ability, reinforcement learning can dramatically improve their performance.
Authors:Kohou Wang, ZhaoXiang Liu, Lin Bai, Kun Fan, Xiang Liu, Huan Hu, Kai Wang, Shiguo Lian
Abstract:
It is crucial that robots' performance can be improved after deployment, as they are inherently likely to encounter novel scenarios never seen before. This paper presents an innovative solution: an interactive learning-based robot system powered by a Multi-modal Large Language Model(MLLM). A key feature of our system is its ability to learn from natural dialogues with non-expert users. We also propose chain of question to clarify the exact intent of the question before providing an answer and dual-modality retrieval modules to leverage these interaction events to avoid repeating same mistakes, ensuring a seamless user experience before model updates, which is in contrast to current mainstream MLLM-based robotic systems. Our system marks a novel approach in robotics by integrating interactive learning, paving the way for superior adaptability and performance in diverse environments. We demonstrate the effectiveness and improvement of our method through experiments, both quantitively and qualitatively.
中文摘要:本文提出了一种基于多模态大语言模型的交互式学习机器人系统,通过与普通用户的自然对话进行学习,采用问题链澄清机制和双模态检索模块,有效提升机器人在不同环境中的适应性和性能表现。
English Summary: This paper introduces an interactive learning-based robot system using a Multi-modal Large Language Model that learns from natural dialogues with non-expert users, enhancing adaptability and performance through innovative features like chain of questioning and dual-modality retrieval.
Authors:Haipeng Li, Tianhao Zhou, Zhanglei Yang, Yi Wu, Yan Chen, Zijing Mao, Shen Cheng, Bing Zeng, Shuaicheng Liu
Abstract:
Estimating 2D camera motion is a fundamental computer vision task that models the projection of 3D camera movements onto the 2D image plane. Current methods rely on either homography-based approaches, limited to planar scenes, or meshflow techniques that use grid-based local homographies but struggle with complex non-linear transformations. A key insight of our work is that combining flow fields from different homographies creates motion patterns that cannot be represented by any single homography. We introduce CamFlow, a novel framework that represents camera motion using hybrid motion bases: physical bases derived from camera geometry and stochastic bases for complex scenarios. Our approach includes a hybrid probabilistic loss function based on the Laplace distribution that enhances training robustness. For evaluation, we create a new benchmark by masking dynamic objects in existing optical flow datasets to isolate pure camera motion. Experiments show CamFlow outperforms state-of-the-art methods across diverse scenarios, demonstrating superior robustness and generalization in zero-shot settings. Code and datasets are available at our project page: https://lhaippp.github.io/CamFlow/.
中文:CamFlow提出了一种新颖的相机运动估计框架,通过混合运动基和鲁棒的概率损失函数,在多样化场景中超越现有方法,并在零样本设置下展现出卓越的泛化能力。
English: CamFlow introduces a novel camera motion estimation framework using hybrid motion bases and a robust probabilistic loss, outperforming existing methods across diverse scenarios and demonstrating superior generalization in zero-shot settings.
Authors:Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li, Hu Xu
Abstract:
Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.
中文:Meta CLIP 2 提出了一种基于全球网络规模图文对的全新CLIP训练方案,在多语言基准测试中创下最新性能记录,并在零样本分类和检索任务中超越了仅使用英语数据的模型。
English: Meta CLIP 2 introduces a novel training approach for CLIP using worldwide web-scale image-text pairs, achieving state-of-the-art performance in multilingual benchmarks and surpassing English-only models in zero-shot classification and retrieval tasks.
Authors:Shengcai Zhou, Luping Xiang, Kun Yang, Kai Kit Wong, Dapeng Oliver Wu, Chan-Byoung Chae
Abstract:
Airborne mobile Integrated Sensing and Communication (ISAC) base stations have garnered significant attention recently, with ISAC technology being a crucial application for 6G networks. Since ISAC can sense potential mobile communication users, this paper studies an effective scheme for a multi-UAV network tailored for emergency communication. In this paper, we develop a temporal-assisted frame structure utilizing integrated omnidirectional and directional beampattern to facilitate efficient and frequent searching, with extended Kalman filtering (EKF) as an aid to beam alignment. Further, we address an optimization problem to maximize the total achievable rate per slot by jointly designing UAV beamforming, load management, and UAV direction planning, all while adhering to the constraints of the predicted beam coverage. Given the problem NP-hard, we introduce three robust mechanisms for its resolution: an enhanced distributed Successive Convex Approximation (SCA)-Iterative Rank Minimization (IRM) algorithm, an coalition game approach, and a Fermat point search method. In particular, the proposed SCA-IRM algorithm decomposes the original complex optimization problem into several sub-problems and assigns them equally to each UAV, so as to realize distributed computing and improve computational efficiency. Our proposed simulations demonstrate the improved system performance in terms of communication rate, fairness, and sensing accuracy, providing design guidelines of UAV-assisted emergency communication networking.
中文: 本文针对应急通信提出了一种多无人机网络方案,采用通感一体化技术,通过时域辅助帧结构和分布式优化算法来提高通信速率和系统性能。
English: This paper proposes a multi-UAV network scheme for emergency communication that employs integrated sensing and communication technology, featuring a temporal-assisted frame structure and distributed optimization algorithms to enhance communication rates and system efficiency.
Authors:Jianhui Wang, Yinda Chen, Yangfan He, Xinyuan Song, Yi Xin, Dapeng Zhang, Zhongwei Wan, Bin Li, Rongchao Zhang
Abstract:
Video editing is a critical component of content creation that transforms raw footage into coherent works aligned with specific visual and narrative objectives. Existing approaches face two major challenges: temporal inconsistencies due to failure in capturing complex motion patterns, and overfitting to simple prompts arising from limitations in UNet backbone architectures. While learning-based methods can enhance editing quality, they typically demand substantial computational resources and are constrained by the scarcity of high-quality annotated data. In this paper, we present Vid-TTA, a lightweight test-time adaptation framework that personalizes optimization for each test video during inference through self-supervised auxiliary tasks. Our approach incorporates a motion-aware frame reconstruction mechanism that identifies and preserves crucial movement regions, alongside a prompt perturbation and reconstruction strategy that strengthens model robustness to diverse textual descriptions. These innovations are orchestrated by a meta-learning driven dynamic loss balancing mechanism that adaptively adjusts the optimization process based on video characteristics. Extensive experiments demonstrate that Vid-TTA significantly improves video temporal consistency and mitigates prompt overfitting while maintaining low computational overhead, offering a plug-and-play performance boost for existing video editing models.
中文摘要:Vid-TTA是一种轻量级测试时自适应框架,通过自监督的运动感知重建和提示扰动策略,有效提升视频编辑的时间一致性并减少提示过拟合问题。
English Summary: Vid-TTA is a lightweight test-time adaptation framework that enhances video editing by improving temporal consistency and reducing prompt overfitting through self-supervised motion-aware reconstruction and prompt perturbation strategies.
Authors:HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, Yihang Lian, Yulin Tsai, Lifu Wang, Sicong Liu, Puhua Jiang, Xianghui Yang, Dongyuan Guo, Yixuan Tang, Xinyue Mao, Jiaao Yu, Junlin Yu, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Chao Zhang, Yonghao Tan, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Minghui Chen, Zhan Li, Wangchen Qin, Lei Wang, Yifu Sun, Lin Niu, Xiang Yuan, Xiaofeng Yang, Yingping He, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Tian Liu, Peng Chen, Di Wang, Yuhong Liu, Linus, Jie Jiang, Tengfei Wang, Chunchao Guo
Abstract:
Creating immersive and playable 3D worlds from texts or images remains a fundamental challenge in computer vision and graphics. Existing world generation approaches typically fall into two categories: video-based methods that offer rich diversity but lack 3D consistency and rendering efficiency, and 3D-based methods that provide geometric consistency but struggle with limited training data and memory-inefficient representations. To address these limitations, we present HunyuanWorld 1.0, a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions. Our approach features three key advantages: 1) 360° immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity. The core of our framework is a semantically layered 3D mesh representation that leverages panoramic images as 360° world proxies for semantic-aware world decomposition and reconstruction, enabling the generation of diverse 3D worlds. Extensive experiments demonstrate that our method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds while enabling versatile applications in virtual reality, physical simulation, game development, and interactive content creation.
中文: HunyuanWorld 1.0 提出了一种创新框架,通过结合全景世界代理与语义分层网格表示,从文本或图像生成沉浸式、可探索且可交互的3D场景,克服了现有方法的局限,并实现了在虚拟现实等领域的多样化应用。
English: HunyuanWorld 1.0 introduces a novel framework that generates immersive, explorable, and interactive 3D scenes from text or images by combining panoramic world proxies with a semantically layered mesh representation, overcoming limitations of existing methods and enabling versatile applications in virtual reality and beyond.
Authors:Meilin Zhu, Gaojie Jin, Xiaowei Huang, Lijun Zhang
Abstract:
In question-answering tasks, determining when to trust the outputs is crucial to the alignment of large language models (LLMs). Kuhn et al. (2023) introduces semantic entropy as a measure of uncertainty, by incorporating linguistic invariances from the same meaning. It primarily relies on setting threshold to measure the level of semantic equivalence relation. We propose a more nuanced framework that extends beyond such thresholding by developing a Shapley-based uncertainty metric that captures the continuous nature of semantic relationships. We establish three fundamental properties that characterize valid uncertainty metrics and prove that our Shapley uncertainty satisfies these criteria. Through extensive experiments, we demonstrate that our Shapley uncertainty more accurately predicts LLM performance in question-answering and other datasets, compared to similar baseline measures.
中文: 本文提出了一种基于沙普利值的语义不确定性度量方法,该方法能捕捉连续的语义关系,经证明满足三项关键标准,并通过实验验证其在预测大语言模型性能方面优于语义熵等基线方法。
English: This paper introduces a Shapley-based uncertainty metric that captures continuous semantic relationships, proving it meets three key criteria and demonstrating its superior accuracy in predicting LLM performance over baseline measures like semantic entropy.
Authors:Lukman Jibril Aliyu, Umar Sani Muhammad, Bilqisu Ismail, Nasiru Muhammad, Almustapha A Wakili, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad, Mustapha Abdullahi
Abstract:
Wildlife populations in Africa face severe threats, with vertebrate numbers declining by over 65% in the past five decades. In response, image classification using deep learning has emerged as a promising tool for biodiversity monitoring and conservation. This paper presents a comparative study of deep learning models for automatically classifying African wildlife images, focusing on transfer learning with frozen feature extractors. Using a public dataset of four species: buffalo, elephant, rhinoceros, and zebra; we evaluate the performance of DenseNet-201, ResNet-152, EfficientNet-B4, and Vision Transformer ViT-H/14. DenseNet-201 achieved the best performance among convolutional networks (67% accuracy), while ViT-H/14 achieved the highest overall accuracy (99%), but with significantly higher computational cost, raising deployment concerns. Our experiments highlight the trade-offs between accuracy, resource requirements, and deployability. The best-performing CNN (DenseNet-201) was integrated into a Hugging Face Gradio Space for real-time field use, demonstrating the feasibility of deploying lightweight models in conservation settings. This work contributes to African-grounded AI research by offering practical insights into model selection, dataset preparation, and responsible deployment of deep learning tools for wildlife conservation.
中文摘要:本研究比较了非洲野生动物图像分类的深度学习模型,发现虽然Vision Transformer达到99%准确率,但DenseNet-201在性能与部署可行性间取得最佳平衡,更适合保护实践。
English Summary: This study compares deep learning models for classifying African wildlife images, finding that while Vision Transformer achieved 99% accuracy, DenseNet-201 offered the best balance of performance and deployability for conservation applications.
Authors:Zedong Wang, Siyuan Li, Dan Xu
Abstract:
Despite the promise of Multi-Task Learning in leveraging complementary knowledge across tasks, existing multi-task optimization (MTO) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space, where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO. This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment, Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL's efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.
中文: 现有多任务优化方法过度依赖优化器解决任务冲突,而忽视了共享表示空间的潜力;Rep-MTL通过表征层面的任务显著性量化任务交互,在缓解负迁移的同时显式促进跨任务互补性学习。
English: Current multi-task optimization methods overly focus on resolving task conflicts through optimizers but overlook the potential of shared representation spaces, leading to the development of Rep-MTL which leverages representation-level saliency to enhance complementary knowledge sharing and mitigate negative transfer.
Authors:Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S. Ren, Jinjin Gu, Chao Dong
Abstract:
Deep image restoration models aim to learn a mapping from degraded image space to natural image space. However, they face several critical challenges: removing degradation, generating realistic details, and ensuring pixel-level consistency. Over time, three major classes of methods have emerged, including MSE-based, GAN-based, and diffusion-based methods. However, they fail to achieve a good balance between restoration quality, fidelity, and speed. We propose a novel method, HYPIR, to address these challenges. Our solution pipeline is straightforward: it involves initializing the image restoration model with a pre-trained diffusion model and then fine-tuning it with adversarial training. This approach does not rely on diffusion loss, iterative sampling, or additional adapters. We theoretically demonstrate that initializing adversarial training from a pre-trained diffusion model positions the initial restoration model very close to the natural image distribution. Consequently, this initialization improves numerical stability, avoids mode collapse, and substantially accelerates the convergence of adversarial training. Moreover, HYPIR inherits the capabilities of diffusion models with rich user control, enabling text-guided restoration and adjustable texture richness. Requiring only a single forward pass, it achieves faster convergence and inference speed than diffusion-based methods. Extensive experiments show that HYPIR outperforms previous state-of-the-art methods, achieving efficient and high-quality image restoration.
中文: HYPIR是一种创新的图像修复方法,通过预训练扩散模型初始化并采用对抗训练微调,无需依赖扩散损失或迭代采样,即可实现高质量修复,同时提升稳定性、速度并支持用户控制。
English: HYPIR is a novel image restoration method that initializes with a pre-trained diffusion model and fine-tunes via adversarial training, achieving high-quality results with enhanced stability, speed, and user control without relying on diffusion loss or iterative sampling.
Authors:Chaitanya Manem, Pratik Prabhanjan Brahma, Prakamya Mishra, Zicheng Liu, Emad Barsoum
Abstract:
The demand for Large Language Models (LLMs) capable of sophisticated mathematical reasoning is growing across industries. However, the development of performant mathematical LLMs is critically bottlenecked by the scarcity of difficult, novel training data. We introduce \textbf{SAND-Math} (Synthetic Augmented Novel and Difficult Mathematics problems and solutions), a pipeline that addresses this by first generating high-quality problems from scratch and then systematically elevating their complexity via a new \textbf{Difficulty Hiking} step. We demonstrate the effectiveness of our approach through two key findings. First, augmenting a strong baseline with SAND-Math data significantly boosts performance, outperforming the next-best synthetic dataset by \textbf{$\uparrow$ 17.85 absolute points} on the AIME25 benchmark. Second, in a dedicated ablation study, we show our Difficulty Hiking process is highly effective: by increasing average problem difficulty from 5.02 to 5.98, this step lifts AIME25 performance from 46.38\% to 49.23\%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building more capable and efficient mathematical reasoning LLMs. SAND-Math dataset is released here: \href{https://huggingface.co/datasets/amd/SAND-MATH}{https://huggingface.co/datasets/amd/SAND-MATH}
Chinese: SAND-Math 管道通过生成并提升数学问题难度来解决训练数据稀缺问题,在 AIME25 基准测试中使大语言模型性能显著提升高达 17.85 分。
English: The SAND-Math pipeline generates and enhances mathematical problems to overcome training data scarcity, significantly boosting LLM performance on benchmarks like AIME25 by up to 17.85 points.
Authors:Haiyang Bai, Jiaqi Zhu, Songru Jiang, Wei Huang, Tao Lu, Yuanqi Li, Jie Guo, Runze Fu, Yanwen Guo, Lijun Chen
Abstract:
We propose a 3D Gaussian splatting-based framework for outdoor relighting that leverages intrinsic image decomposition to precisely integrate sunlight, sky radiance, and indirect lighting from unconstrained photo collections. Unlike prior methods that compress the per-image global illumination into a single latent vector, our approach enables simultaneously diverse shading manipulation and the generation of dynamic shadow effects. This is achieved through three key innovations: (1) a residual-based sun visibility extraction method to accurately separate direct sunlight effects, (2) a region-based supervision framework with a structural consistency loss for physically interpretable and coherent illumination decomposition, and (3) a ray-tracing-based technique for realistic shadow simulation. Extensive experiments demonstrate that our framework synthesizes novel views with competitive fidelity against state-of-the-art relighting solutions and produces more natural and multifaceted illumination and shadow effects.
中文: 本研究提出了一种基于3D高斯泼溅的户外重光照框架,通过本征图像分解整合阳光、天空辐射和间接光照,借助阳光可见性提取、区域监督和光线追踪技术,实现了多样化的阴影操控和动态阴影效果。
English: This study introduces a 3D Gaussian splatting framework for outdoor relighting that uses intrinsic image decomposition to integrate sunlight, sky radiance, and indirect lighting, enabling diverse shading manipulation and dynamic shadows through innovations in sun visibility extraction, region-based supervision, and ray-tracing for realistic effects.
Authors:Cheng-Fu Yang, Thanh Tran, Christos Christodoulopoulos, Weitong Ruan, Rahul Gupta, Kai-Wei Chang
Abstract:
A multi-modal guardrail must effectively filter image content based on user-defined policies, identifying material that may be hateful, reinforce harmful stereotypes, contain explicit material, or spread misinformation. Deploying such guardrails in real-world applications, however, poses significant challenges. Users often require varied and highly customizable policies and typically cannot provide abundant examples for each custom policy. Consequently, an ideal guardrail should be scalable to the multiple policies and adaptable to evolving user standards with minimal retraining. Existing fine-tuning methods typically condition predictions on pre-defined policies, restricting their generalizability to new policies or necessitating extensive retraining to adapt. Conversely, training-free methods struggle with limited context lengths, making it difficult to incorporate all the policies comprehensively. To overcome these limitations, we propose to condition model's judgment on "precedents", which are the reasoning processes of prior data points similar to the given input. By leveraging precedents instead of fixed policies, our approach greatly enhances the flexibility and adaptability of the guardrail. In this paper, we introduce a critique-revise mechanism for collecting high-quality precedents and two strategies that utilize precedents for robust prediction. Experimental results demonstrate that our approach outperforms previous methods across both few-shot and full-dataset scenarios and exhibits superior generalization to novel policies.
中文: 本文提出的护栏系统采用基于先例的方法和审阅修订机制,显著提升了内容过滤的自定义策略适应性和灵活性,无需大量重新训练即可在新策略上优于现有方法。
English: The proposed guardrail system uses a precedent-based approach with a critique-revise mechanism to enhance flexibility and adaptability, outperforming existing methods in filtering content according to customizable policies without extensive retraining.
Authors:Siyi Wu, Junqiao Wang, Zhaoyang Guan, Leyi Zhao, Xinyuan Song, Xinyu Ying, Dexu Yu, Jinhao Wang, Hanlin Zhang, Michele Pak, Yangfan He, Yi Xin, Jianhui Wang, Tianyu Shi
Abstract:
Cryptocurrency trading is a challenging task requiring the integration of heterogeneous data from multiple modalities. Traditional deep learning and reinforcement learning approaches typically demand large training datasets and encode diverse inputs into numerical representations, often at the cost of interpretability. Recent progress in large language model (LLM)-based agents has demonstrated the capacity to process multi-modal data and support complex investment decision-making. Building on these advances, we present \textbf{MountainLion}, a multi-modal, multi-agent system for financial trading that coordinates specialized LLM-based agents to interpret financial data and generate investment strategies. MountainLion processes textual news, candlestick charts, and trading signal charts to produce high-quality financial reports, while also enabling modification of reports and investment recommendations through data-driven user interaction and question answering. A central reflection module analyzes historical trading signals and outcomes to continuously refine decision processes, and the system is capable of real-time report analysis, summarization, and dynamic adjustment of investment strategies. Empirical results confirm that MountainLion systematically enriches technical price triggers with contextual macroeconomic and capital flow signals, providing a more interpretable, robust, and actionable investment framework that improves returns and strengthens investor confidence.
中文: MountainLion是一个多模态、多智能体系统,通过协调专门的基于大语言模型的智能体来处理各类金融数据、生成可解释的投资策略,并借助反思模块持续优化决策,从而提高收益并增强投资者信心。
English: MountainLion is a multi-modal, multi-agent system that uses specialized LLM-based agents to process diverse financial data, generate interpretable investment strategies, and continuously refine decisions through reflection, resulting in improved returns and investor confidence.
Authors:Yinuo Deng, Hailiang Zhao, Dongjing Wang, Peng Chen, Wenzhuo Qian, Jianwei Yin, Schahram Dustdar, Shuiguang Deng
Abstract:
Efficient container image distribution is crucial for enabling machine learning inference at the network edge, where resource limitations and dynamic network conditions create significant challenges. In this paper, we present PeerSync, a decentralized P2P-based system designed to optimize image distribution in edge environments. PeerSync employs a popularity- and network-aware download engine that dynamically adapts to content popularity and real-time network conditions using a sliding window mechanism. PeerSync further integrates automated tracker election for rapid peer discovery and dynamic cache management for efficient storage utilization. We implement PeerSync with 8000+ lines of Rust code and test its performance extensively on both physical edge devices and Docker-based emulations. Experimental results show that PeerSync delivers a remarkable speed increase of 2.72$\times$, 1.79$\times$, and 1.28$\times$ compared to the Baseline, Dragonfly, and Kraken, respectively, while significantly reducing peak cross-network traffic by 90.72\% under congested and varying network conditions.
中文: 本文提出PeerSync这一去中心化P2P系统,通过动态适应内容流行度和网络状况来优化边缘计算中的容器镜像分发,相比现有方案实现了显著的速度提升和流量削减。
English: This paper introduces PeerSync, a decentralized P2P system that optimizes container image distribution for edge computing by dynamically adapting to content popularity and network conditions, achieving significant speed improvements and traffic reduction compared to existing solutions.
Authors:Andrei Vlad Man, RÄzvan-Alexandru SmÄdu, Cristian-George Craciun, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel
Abstract:
The intersection of AI and legal systems presents a growing need for tools that support legal education, particularly in under-resourced languages such as Romanian. In this work, we aim to evaluate the capabilities of Large Language Models (LLMs) and Vision-Language Models (VLMs) in understanding and reasoning about Romanian driving law through textual and visual question-answering tasks. To facilitate this, we introduce RoD-TAL, a novel multimodal dataset comprising Romanian driving test questions, text-based and image-based, alongside annotated legal references and human explanations. We implement and assess retrieval-augmented generation (RAG) pipelines, dense retrievers, and reasoning-optimized models across tasks including Information Retrieval (IR), Question Answering (QA), Visual IR, and Visual QA. Our experiments demonstrate that domain-specific fine-tuning significantly enhances retrieval performance. At the same time, chain-of-thought prompting and specialized reasoning models improve QA accuracy, surpassing the minimum grades required to pass driving exams. However, visual reasoning remains challenging, highlighting the potential and the limitations of applying LLMs and VLMs to legal education.
中文摘要:本研究通过新构建的RoD-TAL数据集评估大语言模型和视觉语言模型对罗马尼亚交通法规的理解能力,结果表明领域微调与推理技术能有效提升文本任务表现,但视觉推理仍是当前技术瓶颈。
English Summary: This study evaluates the performance of LLMs and VLMs in understanding Romanian driving laws using the newly developed RoD-TAL dataset, showing that fine-tuning and reasoning techniques improve text-based tasks but visual reasoning remains a challenge.
Authors:Haochen Wang, Qirui Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie, Stratis Gavves
Abstract:
Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multiround interactions. In this paper, we make three contributions: (i) we address these limitations by introducing a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks, i.e., allowing users to interact with videos using both textual and visual prompts; (ii) we propose STOM (Spatial-Temporal Overlay Module), a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video; (iii) we present VideoInfer, a manually curated object-centric video instruction dataset featuring questionanswering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring object segmentation. The results on 12 benchmarks of 6 tasks show that our proposed model consistently outperforms baselines in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. Project page: https://qirui-chen.github.io/RGA3-release/.
中文: 本文提出一种支持多模态交互的视频大语言模型,通过新颖的时空覆盖模块增强以物体为中心的视频理解能力,在12个基准测试中均表现优异。
English: This paper introduces a VideoLLM that enhances object-centric video understanding by supporting multimodal interactions and a novel Spatial-Temporal Overlay Module, achieving superior performance across 12 benchmarks.
Authors:Xinyang Wang, Hongwei Zhang, Jun Xu, Shimin Wang, Martin Guay
Abstract:
Safety and stability are two critical concerns in pursuit-evasion (PE) problems in an obstacle-rich environment. Most existing works combine control barrier functions (CBFs) and reinforcement learning (RL) to provide an efficient and safe solution. However, they do not consider the presence of disturbances, such as wind gust and actuator fault, which may exist in many practical applications. This paper integrates CBFs and a sliding mode control (SMC) term into RL to simultaneously address safety, stability, and robustness to disturbances. However, this integration is significantly challenging due to the strong coupling between the CBF and SMC terms. Inspired by Stackelberg game, we handle the coupling issue by proposing a hierarchical design scheme where SMC and safe control terms interact with each other in a leader-follower manner. Specifically, the CBF controller, acting as the leader, enforces safety independently of the SMC design; while the SMC term, as the follower, is designed based on the CBF controller. We then formulate the PE problem as a zero-sum game and propose a safe robust RL framework to learn the min-max strategy online. A sufficient condition is provided under which the proposed algorithm remains effective even when constraints are conflicting. Simulation results demonstrate the effectiveness of the proposed safe robust RL framework.
Chinese: 本文提出了一种安全鲁棒的强化学习框架,通过将控制屏障函数与滑模控制结合在分层Stackelberg博弈结构中,解决了追逃问题中的安全性、稳定性和抗干扰鲁棒性。
English: This paper introduces a safe robust reinforcement learning framework that integrates control barrier functions with sliding mode control in a hierarchical Stackelberg game structure to address safety, stability, and disturbance robustness in pursuit-evasion problems.
Authors:Yiming Zhang, Chengzhang Yu, Zhuokai Zhao, Kun Wang, Qiankun Li, Zihan Chen, Yang Liu, Zenghui Ding, Yining Sun
Abstract:
The processing mechanisms underlying language and image understanding in large vision-language models (LVLMs) have been extensively studied. However, the internal reasoning mechanisms of LVLMs for spatiotemporal understanding remain poorly understood. In this work, we introduce a systematic, circuit-based framework designed to investigate how spatiotemporal visual semantics are represented and processed within these LVLMs. Specifically, our framework comprises three circuits: visual auditing circuit, semantic tracing circuit, and attention flow circuit. Through the lens of these circuits, we discover that visual semantics are highly localized to specific object tokens--removing these tokens can degrade model performance by up to 92.6%. Furthermore, we identify that interpretable concepts of objects and actions emerge and become progressively refined in the middle-to-late layers of LVLMs. In contrary to the current works that solely focus on objects in one image, we reveal that the middle-to-late layers of LVLMs exhibit specialized functional localization for spatiotemporal semantics. Our findings offer significant mechanistic insights into spatiotemporal semantics analysis of LVLMs, laying a foundation for designing more robust and interpretable models.
中文: 本研究提出基于电路的框架探究大型视觉语言模型如何处理时空语义,发现关键物体标记对性能至关重要,且可解释概念在中后层形成,为设计更稳健模型提供了机制性见解。
English: This study introduces a circuit-based framework to explore how large vision-language models process spatiotemporal semantics, revealing that key object tokens are crucial for performance and that interpretable concepts develop in middle-to-late layers, offering insights for more robust model design.
Authors:Chong Xia, Shengjun Zhang, Fangfu Liu, Chang Liu, Khodchaphun Hirunyaratsameewong, Yueqi Duan
Abstract:
Perpetual 3D scene generation aims to produce long-range and coherent 3D view sequences, which is applicable for long-term video synthesis and 3D scene reconstruction. Existing methods follow a "navigate-and-imagine" fashion and rely on outpainting for successive view expansion. However, the generated view sequences suffer from semantic drift issue derived from the accumulated deviation of the outpainting module. To tackle this challenge, we propose ScenePainter, a new framework for semantically consistent 3D scene generation, which aligns the outpainter's scene-specific prior with the comprehension of the current scene. To be specific, we introduce a hierarchical graph structure dubbed SceneConceptGraph to construct relations among multi-level scene concepts, which directs the outpainter for consistent novel views and can be dynamically refined to enhance diversity. Extensive experiments demonstrate that our framework overcomes the semantic drift issue and generates more consistent and immersive 3D view sequences. Project Page: https://xiac20.github.io/ScenePainter/.
中文: ScenePainter通过引入层次化场景概念图来对齐场景特定先验与当前场景理解,有效克服语义漂移问题,生成连贯且沉浸式的3D视图序列。
English: ScenePainter introduces a hierarchical SceneConceptGraph to align scene-specific priors with current scene comprehension, effectively overcoming semantic drift and generating consistent, immersive 3D view sequences.
Authors:Xiangwen Wang, Gaojie Jin, Xiaowei Huang, Ronghui Mu
Abstract:
Designing mutations to optimize protein thermostability remains challenging due to the complex relationship between sequence variations, structural dynamics, and thermostability, often assessed by δδG
(the change in free energy of unfolding). Existing methods rely on experimental random mutagenesis or prediction models tested with pre-defined datasets, using sequence-based heuristics and treating enzyme design as a one-step process without iterative refinement, which limits design space exploration and restricts discoveries beyond known variations. We present ThermoRL, a framework based on reinforcement learning (RL) that leverages graph neural networks (GNN) to design mutations with enhanced thermostability. It combines a pre-trained GNN-based encoder with a hierarchical Q-learning network and employs a surrogate model for reward feedback, guiding the RL agent on where (the position) and which (mutant amino acid) to apply for enhanced thermostability. Experimental results show that ThermoRL achieves higher or comparable rewards than baselines while maintaining computational efficiency. It filters out destabilizing mutations and identifies stabilizing mutations aligned with experimental data. Moreover, ThermoRL accurately detects key mutation sites in unseen proteins, highlighting its strong generalizability. This RL-guided approach powered by GNN embeddings offers a robust alternative to traditional protein mutation design.
中文: ThermoRL是一种基于强化学习和图神经网络的框架,能有效设计增强蛋白质热稳定性的突变,通过筛选不稳定突变并识别稳定突变,在实验中展现出优越性能和强泛化能力。
English: ThermoRL is a reinforcement learning framework using graph neural networks to design protein mutations that enhance thermostability by efficiently identifying stabilizing mutations and filtering destabilizing ones, demonstrating strong performance and generalizability in experiments.
Authors:Ngoc Luyen Le, Marie-Hélène Abel
Abstract:
Prerequisite skills - foundational competencies required before mastering more advanced concepts - are important for supporting effective learning, assessment, and skill-gap analysis. Traditionally curated by domain experts, these relationships are costly to maintain and difficult to scale. This paper investigates whether large language models (LLMs) can predict prerequisite skills in a zero-shot setting, using only natural language descriptions and without task-specific fine-tuning. We introduce ESCO-PrereqSkill, a benchmark dataset constructed from the ESCO taxonomy, comprising 3,196 skills and their expert-defined prerequisite links. Using a standardized prompting strategy, we evaluate 13 state-of-the-art LLMs, including GPT-4, Claude 3, Gemini, LLaMA 4, Qwen2, and DeepSeek, across semantic similarity, BERTScore, and inference latency. Our results show that models such as LLaMA4-Maverick, Claude-3-7-Sonnet, and Qwen2-72B generate predictions that closely align with expert ground truth, demonstrating strong semantic reasoning without supervision. These findings highlight the potential of LLMs to support scalable prerequisite skill modeling for applications in personalized learning, intelligent tutoring, and skill-based recommender systems.
中文摘要:本研究证明大型语言模型能在零样本设置下有效预测先决技能,其中LLaMA4-Maverick和Claude-3-7-Sonnet等模型的预测结果与专家判断高度吻合,展现了其在可扩展教育应用中的潜力。
English Summary: This study demonstrates that large language models can effectively predict prerequisite skills in a zero-shot setting, with models like LLaMA4-Maverick and Claude-3-7-Sonnet showing strong alignment with expert judgments, highlighting their potential for scalable educational applications.
Authors:Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, Jiangmiao Pang
Abstract:
Transferring and integrating knowledge across first-person (egocentric) and third-person (exocentric) viewpoints is intrinsic to human intelligence, enabling humans to learn from others and convey insights from their own experiences. Despite rapid progress in multimodal large language models (MLLMs), their ability to perform such cross-view reasoning remains unexplored. To address this, we introduce EgoExoBench, the first benchmark for egocentric-exocentric video understanding and reasoning. Built from publicly available datasets, EgoExoBench comprises over 7,300 question-answer pairs spanning eleven sub-tasks organized into three core challenges: semantic alignment, viewpoint association, and temporal reasoning. We evaluate 13 state-of-the-art MLLMs and find that while these models excel on single-view tasks, they struggle to align semantics across perspectives, accurately associate views, and infer temporal dynamics in the ego-exo context. We hope EgoExoBench can serve as a valuable resource for research on embodied agents and intelligent assistants seeking human-like cross-view intelligence.
中文: 本文提出了首个评估多模态大语言模型在第一人称与第三人称视角间跨视图推理能力的基准EgoExoBench,发现尽管模型在单视角任务表现出色,但在跨视角语义对齐、视图关联和时间推理方面仍存在显著不足。
English: This paper introduces EgoExoBench, the first benchmark for evaluating cross-view reasoning between egocentric and exocentric perspectives in multimodal large language models, revealing their limitations in semantic alignment, viewpoint association, and temporal reasoning despite strong single-view performance.
Authors:Haoran Gao, Yuanhe Zhang, Zhenhong Zhou, Lei Jiang, Fanyu Meng, Yujia Xiao, Kun Wang, Yang Liu, Junlan Feng
Abstract:
Resource Consumption Attacks (RCAs) have emerged as a significant threat to the deployment of Large Language Models (LLMs). With the integration of vision modalities, additional attack vectors exacerbate the risk of RCAs in large vision-language models (LVLMs). However, existing red-teaming studies have largely overlooked visual inputs as a potential attack surface, resulting in insufficient mitigation strategies against RCAs in LVLMs. To address this gap, we propose RECALLED (\textbf{RE}source \textbf{C}onsumption \textbf{A}ttack on \textbf{L}arge Vision-\textbf{L}anguag\textbf{E} Mo\textbf{D}els), the first approach for exploiting visual modalities to trigger unbounded RCAs red-teaming. First, we present \textit{Vision Guided Optimization}, a fine-grained pixel-level optimization, to obtain \textit{Output Recall} adversarial perturbations, which can induce repeating output. Then, we inject the perturbations into visual inputs, triggering unbounded generations to achieve the goal of RCAs. Additionally, we introduce \textit{Multi-Objective Parallel Losses} to generate universal attack templates and resolve optimization conflicts when intending to implement parallel attacks. Empirical results demonstrate that RECALLED increases service response latency by over 26 $\uparrow$, resulting in an additional 20\% increase in GPU utilization and memory consumption. Our study exposes security vulnerabilities in LVLMs and establishes a red-teaming framework that can facilitate future defense development against RCAs.
中文总结:RECITE是首个利用视觉输入触发大型视觉语言模型资源消耗攻击的红队测试方法,显著增加了系统延迟和资源消耗,同时揭示了模型的安全漏洞。
English Summary: RECITE is the first red-teaming approach that exploits visual inputs to trigger resource consumption attacks in large vision-language models, significantly increasing latency and resource usage while exposing security vulnerabilities.
Authors:Haoran Gao, Yuanhe Zhang, Zhenhong Zhou, Lei Jiang, Fanyu Meng, Yujia Xiao, Li Sun, Kun Wang, Yang Liu, Junlan Feng
Abstract:
Resource Consumption Attacks (RCAs) have emerged as a significant threat to the deployment of Large Language Models (LLMs). With the integration of vision modalities, additional attack vectors exacerbate the risk of RCAs in large vision-language models (LVLMs). However, existing red-teaming studies have mainly overlooked visual inputs as a potential attack surface, resulting in insufficient mitigation strategies against RCAs in LVLMs. To address this gap, we propose RECITE ($\textbf{Re}$source $\textbf{C}$onsumpt$\textbf{i}$on Red-$\textbf{Te}$aming for LVLMs), the first approach for exploiting visual modalities to trigger unbounded RCAs red-teaming. First, we present $\textit{Vision Guided Optimization}$, a fine-grained pixel-level optimization to obtain \textit{Output Recall Objective} adversarial perturbations, which can induce repeating output. Then, we inject the perturbations into visual inputs, triggering unbounded generations to achieve the goal of RCAs. Empirical results demonstrate that RECITE increases service response latency by over 26 $\uparrow$, resulting in an additional 20\% increase in GPU utilization and memory consumption. Our study reveals security vulnerabilities in LVLMs and establishes a red-teaming framework that can facilitate the development of future defenses against RCAs.
中文总结:RECITE是首个利用视觉输入触发大型视觉语言模型资源消耗攻击的红队测试方法,显著增加了系统延迟和资源消耗,同时揭示了模型的安全漏洞。
English Summary: RECITE is the first red-teaming approach that exploits visual inputs to trigger resource consumption attacks in large vision-language models, significantly increasing latency and resource usage while exposing security vulnerabilities.
Authors:Ramin Giahi, Kehui Yao, Sriram Kollipara, Kai Zhao, Vahid Mirjalili, Jianpeng Xu, Topojoy Biswas, Evren Korpeoglu, Kannan Achan
Abstract:
Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross-modal matching; and 3) Domain mismatch, as generic vision-language models may not generalize well to e-commerce-specific data. To address these limitations, we propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding and an LLM-based agent for generating enriched text embeddings. Visual Grounding refines image representations by localizing key products, while the LLM agent enhances textual features by disambiguating product descriptions. Our approach significantly improves retrieval accuracy, multimodal retrieval effectiveness, and recommendation quality across tens of millions of items on one of the largest e-commerce platforms in the U.S., increasing CTR by 18.6%, ATC by 15.5%, and GMV by 4.0%. Additional experimental results show that our framework outperforms vision-language models, including CLIP, FashionCLIP, and GCL, in both precision and semantic alignment, demonstrating the potential of combining object-aware visual grounding and LLM-enhanced text representation for robust multimodal recommendations.
Chinese: VL-CLIP框架通过视觉定位提升细粒度图像理解,并利用LLM智能体优化文本表征,显著提高了电子商务推荐系统的点击率和交易额,在多项指标上超越现有视觉语言模型。
English: The proposed VL-CLIP framework enhances CLIP by incorporating visual grounding for fine-grained image analysis and an LLM agent for enriched text embeddings, significantly boosting e-commerce recommendation metrics like CTR and GMV while outperforming existing models.
Authors:Xiaoyu Zhan, Xinyu Fu, Hao Sun, Yuanqi Li, Jie Guo, Yanwen Guo
Abstract:
The rapid advancement of large language models (LLMs) has enabled role-playing language agents to demonstrate significant potential in various applications. However, relying solely on prompts and contextual inputs often proves insufficient for achieving deep immersion in specific roles, particularly well-known fictional or public figures. On the other hand, fine-tuning-based approaches face limitations due to the challenges associated with data collection and the computational resources required for training, thereby restricting their broader applicability. To address these issues, we propose Test-Time-Matching (TTM), a training-free role-playing framework through test-time scaling and context engineering. TTM uses LLM agents to automatically decouple a character's features into personality, memory, and linguistic style. Our framework involves a structured, three-stage generation pipeline that utilizes these features for controlled role-playing. It achieves high-fidelity role-playing performance, also enables seamless combinations across diverse linguistic styles and even variations in personality and memory. We evaluate our framework through human assessment, and the results demonstrate that our method achieves the outstanding performance in generating expressive and stylistically consistent character dialogues.
中文摘要:提出的测试时匹配框架通过自动解构角色特征并采用结构化生成流程,无需训练即可实现高保真角色扮演,在生成表现力丰富且风格一致的角色对话方面表现卓越。
English Summary: The proposed Test-Time-Matching (TTM) framework enables high-fidelity role-playing by automatically decoupling character features and using a structured generation pipeline, achieving outstanding performance without requiring training.
Authors:Frederic Jonske, Constantin Seibold, Osman Alperen Koras, Fin Bahnsen, Marie Bauer, Amin Dada, Hamza Kalisch, Anton Schily, Jens Kleesiek
Abstract:
Reliable end-to-end clinical report generation has been a longstanding goal of medical ML research. The end goal for this process is to alleviate radiologists' workloads and provide second opinions to clinicians or patients. Thus, a necessary prerequisite for report generation models is a strong general performance and some type of innate grounding capability, to convince clinicians or patients of the veracity of the generated reports. In this paper, we present ASaRG (\textbf{A}utomatic \textbf{S}egmentation-\textbf{a}ssisted \textbf{R}eport \textbf{G}eneration), an extension of the popular LLaVA architecture that aims to tackle both of these problems. ASaRG proposes to fuse intermediate features and fine-grained segmentation maps created by specialist radiological models into LLaVA's multi-modal projection layer via simple concatenation. With a small number of added parameters, our approach achieves a +0.89\% performance gain ($p=0.012$) in CE F1 score compared to the LLaVA baseline when using only intermediate features, and +2.77\% performance gain ($p<0.001$) when adding a combination of intermediate features and fine-grained segmentation maps. Compared with COMG and ORID, two other report generation methods that utilize segmentations, the performance gain amounts to 6.98\% and 6.28\% in F1 score, respectively. ASaRG is not mutually exclusive with other changes made to the LLaVA architecture, potentially allowing our method to be combined with other advances in the field. Finally, the use of an arbitrary number of segmentations as part of the input demonstrably allows tracing elements of the report to the corresponding segmentation maps and verifying the groundedness of assessments. Our code will be made publicly available at a later date.
Chinese: ASaRG通过将放射学模型的中间特征和精细分割图融入LLaVA架构,在临床报告生成中实现了显著性能提升,同时确保评估结果具备可追溯的实证依据。
English: ASaRG enhances the LLaVA architecture by integrating intermediate features and fine-grained segmentation maps from radiological models, achieving significant performance gains in clinical report generation while enabling verifiable grounding of assessments.
Authors:Peng Chen, Hailiang Zhao, Jiaji Zhang, Xueyan Tang, Yixuan Wang, Shuiguang Deng
Abstract:
The online caching problem aims to minimize cache misses when serving a sequence of requests under a limited cache size. While naive learning-augmented caching algorithms achieve ideal $1$-consistency, they lack robustness guarantees. Existing robustification methods either sacrifice $1$-consistency or introduce significant computational overhead. In this paper, we introduce Guard, a lightweight robustification framework that enhances the robustness of a broad class of learning-augmented caching algorithms to $2H_k + 2$, while preserving their $1$-consistency. Guard achieves the current best-known trade-off between consistency and robustness, with only $O(1)$ additional per-request overhead, thereby maintaining the original time complexity of the base algorithm. Extensive experiments across multiple real-world datasets and prediction models validate the effectiveness of Guard in practice.
中文: Guard框架通过轻量级鲁棒化设计,在保持学习增强缓存算法1-一致性性能的同时,将其鲁棒性提升至2H_k + 2,仅需常数级额外计算开销,并通过多场景实验验证了其有效性。
English: The Guard framework enhances learning-augmented caching algorithms by achieving optimal 1-consistency and 2H_k + 2 robustness with minimal computational overhead, validated through extensive real-world experiments.
Authors:Zhili Zeng, Kimya Khakzad Shahandashti, Alvine Boaye Belle, Song Wang, Zhen Ming, Jiang
Abstract:
The rapid advancement of mobile applications has led to a significant demand for cross-platform compatibility, particularly between the Android and iOS platforms. Traditional approaches to mobile application translation often rely on manual intervention or rule-based systems, which are labor-intensive and time-consuming. While recent advancements in machine learning have introduced automated methods, they often lack contextual understanding and adaptability, resulting in suboptimal translations. Large Language Models (LLMs) were recently leveraged to enhance code translation at different granularities, including the method, class, and repository levels. Researchers have investigated common errors, limitations, and potential strategies to improve these tasks. However, LLM-based application translation across different platforms, such as migrating mobile applications between Android and iOS or adapting software across diverse frameworks, remains underexplored. Understanding the performance, strengths, and limitations of LLMs in cross-platform application translation is critical for advancing software engineering automation. This study aims to fill this gap by evaluating LLM-based agentic approaches for mobile application translation, identifying key failure points, and proposing guidelines to improve translation performance. We developed a chain of agents that account for dependencies, specifications, program structure, and program control flow when translating applications from Android to iOS. To evaluate the performance, we manually examined the translated code for syntactic correctness, semantic accuracy, and functional completeness. For translation failures, we further conducted a detailed root cause analysis to understand the underlying limitations of the agentic translation process and identify opportunities for improvement.
Chinese: 本研究评估了基于大语言模型的智能体方法在安卓与iOS移动应用翻译中的表现,通过句法、语义和功能分析识别关键失败点,并提出改进翻译性能的指导原则。
English: This study evaluates LLM-based agentic approaches for translating mobile applications between Android and iOS, identifying key failure points and proposing guidelines to enhance translation performance through syntactic, semantic, and functional analysis.
Authors:Junwen Wang, Oscar MacCormac, William Rochford, Aaron Kujawa, Jonathan Shapey, Tom Vercauteren
Abstract:
Rich and accurate medical image segmentation is poised to underpin the next generation of AI-defined clinical practice by delineating critical anatomy for pre-operative planning, guiding real-time intra-operative navigation, and supporting precise post-operative assessment. However, commonly used learning methods for medical and surgical imaging segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the labels space. This becomes particularly problematic as the cardinality and richness of labels increases to include subtly different classes. In this work, we propose two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations to extend the applicability of our proposed losses. Extensive experiments are reported on two medical and surgical image segmentation tasks, namely head MRI for whole brain parcellation (WBP) with full supervision and neurosurgical hyperspectral imaging (HSI) for scene understanding with sparse annotations. Results demonstrate that our proposed method reaches state-of-the-art performance in both cases.
Chinese: 本研究提出两种基于树结构的语义损失函数,利用标签的层次结构提升医学图像分割精度,在全脑分区和神经外科场景理解任务中均达到了领先性能。
English: This study introduces two tree-based semantic loss functions that leverage hierarchical label structures to improve medical image segmentation, achieving state-of-the-art results in whole brain parcellation and neurosurgical imaging tasks.
Authors:Imran Ali Khan, Saif Khan Mohammed, Ronny Hadani, Ananthanarayanan Chockalingam, Robert Calderbank
Abstract:
Wireless users with different characteristics will be expected to share spectrum in next generation communication networks. One of the great strengths of wireless networks based on Orthogonal Frequency Division Multiplexing (OFDM) is the ease with which different non-overlapping time-frequency (TF) resources can be allocated to different users by simply shifting each user's signal in time and frequency. However, a significant weaknesses of OFDM is the inflexibility of sub-carrier spacing. Since OFDM does not allow users to have different sub-carrier spacing, a single user subject to inter-carrier interference causes carrier spacing to increase for all users. Zak-OTFS is an alternative delay-Doppler (DD) domain modulation scheme, where, in contrast to OFDM, the Input-Output (I/O) relation is predictable. We match the strength of OFDM by designing a novel DD domain method of shaping the transmitted Zak-OTFS pulse on the uplink that enables flexible non-overlapping TF resource allocation. The base station (BS) receives a superposition of uplink signals and applies individual matched filters to obtain the data specific to individual users. We develop theoretical measures of interference between users, and present numerical simulations for a vehicular channel model representative of next generation propagation environments. We demonstrate single-user performance in a multiuser Zak-OTFS uplink system without needing to provision guard bands between TF resources allocated to different users. These performance results demonstrate that the benefits of a predictable Zak-OTFS waveform can be realized within an architecture for uplink communication.
中文: Zak-OTFS通过设计新型时延-多普勒域脉冲成形,克服了OFDM子载波间隔僵化的问题,实现了无需保护频带的多用户上行链路资源分配,并通过仿真验证了其优越性能。
English: Zak-OTFS overcomes OFDM's inflexibility by enabling non-overlapping time-frequency resource allocation without guard bands, achieving predictable performance in multiuser uplink systems as demonstrated through simulations.
Authors:Sai Zhang, Zhenchang Xing, Jieshan Chen, Dehai Zhao, Zizhong Zhu, Xiaowang Zhang, Zhiyong Feng, Xiaohong Li
Abstract:
The vision of End-User Software Engineering (EUSE) is to empower non-professional users with full control over the software development lifecycle. It aims to enable users to drive generative software development using only natural language requirements. However, since end-users often lack knowledge of software engineering, their requirement descriptions are frequently ambiguous, raising significant challenges to generative software development. Although existing approaches utilize structured languages like Gherkin to clarify user narratives, they still struggle to express the causal logic between preconditions and behavior actions. This paper introduces RequireCEG, a requirement elicitation and self-review agent that embeds causal-effect graphs (CEGs) in a neuro-symbolic collaboration architecture. RequireCEG first uses a feature tree to analyze user narratives hierarchically, clearly defining the scope of software components and their system behavior requirements. Next, it constructs the self-healing CEGs based on the elicited requirements, capturing the causal relationships between atomic preconditions and behavioral actions. Finally, the constructed CEGs are used to review and optimize Gherkin scenarios, ensuring consistency between the generated Gherkin requirements and the system behavior requirements elicited from user narratives. To evaluate our method, we created the RGPair benchmark dataset and conducted extensive experiments. It achieves an 87% coverage rate and raises diversity by 51.88%.
中文: 本文提出的RequireCEG通过神经符号协作架构嵌入因果效应图,能有效澄清终端用户的模糊自然语言需求,在生成式软件开发中实现了87%的覆盖率和51.88%的多样性提升。
English: This paper introduces RequireCEG, a neuro-symbolic agent that embeds causal-effect graphs to clarify ambiguous natural language requirements from end-users, achieving 87% coverage and 51.88% diversity improvement in generative software development.
Authors:Jiaji Zhang, Ruichao Sun, Hailiang Zhao, Jiaju Wu, Peng Chen, Hao Li, Yuying Liu, Kingsum Chow, Gang Xiong, Shuiguang Deng
Abstract:
Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments. Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data. However, existing PTQ methods for diffusion models often rely on architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines. To address these limitations, we propose SegQuant, a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility. SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations, which is crucial for maintaining visual fidelity in generated outputs. SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.
中文:SegQuant是一种统一的量化框架,通过结合分段感知量化和双尺度方案来提升扩散模型的效率,在无需架构特定限制的情况下确保优异性能与广泛兼容性。
English: SegQuant is a unified quantization framework that enhances diffusion models' efficiency by combining segment-aware quantization and dual-scale schemes, ensuring strong performance and broad compatibility without architecture-specific limitations.
Authors:Luyi Ma, Wanjia Zhang, Kai Zhao, Abhishek Kulkarni, Lalitesh Morishetti, Anjana Ganesh, Ashish Ranjan, Aashika Padmanabhan, Jianpeng Xu, Jason Cho, Praveen Kanumala, Kaushiki Nag, Sumit Dutta, Kamiya Motwani, Malay Patel, Evren Korpeoglu, Sushant Kumar, Kannan Achan
Abstract:
Generative models have recently demonstrated strong potential in multi-behavior recommendation systems, leveraging the expressive power of transformers and tokenization to generate personalized item sequences. However, their adoption is hindered by (1) the lack of explicit information for token reasoning, (2) high computational costs due to quadratic attention complexity and dense sequence representations after tokenization, and (3) limited multi-scale modeling over user history. In this work, we propose GRACE (Generative Recommendation via journey-aware sparse Attention on Chain-of-thought tokEnization), a novel generative framework for multi-behavior sequential recommendation. GRACE introduces a hybrid Chain-of-Thought (CoT) tokenization method that encodes user-item interactions with explicit attributes from product knowledge graphs (e.g., category, brand, price) over semantic tokenization, enabling interpretable and behavior-aligned generation. To address the inefficiency of standard attention, we design a Journey-Aware Sparse Attention (JSA) mechanism, which selectively attends to compressed, intra-, inter-, and current-context segments in the tokenized sequence. Experiments on two real-world datasets show that GRACE significantly outperforms state-of-the-art baselines, achieving up to +106.9% HR@10 and +106.7% NDCG@10 improvement over the state-of-the-art baseline on the Home domain, and +22.1% HR@10 on the Electronics domain. GRACE also reduces attention computation by up to 48% with long sequences.
中文: GRACE是一种创新的生成式推荐框架,通过混合思维链标记化整合显式属性,并采用旅程感知稀疏注意力机制提升效率,在多行为推荐中实现了显著性能提升和计算优化。
English: GRACE is a novel generative framework that enhances multi-behavior recommendation by integrating explicit attributes through hybrid Chain-of-Thought tokenization and improving efficiency with Journey-Aware Sparse Attention, achieving significant performance gains and computational savings.
Authors:Claudio Giusti, Luca Guarnera, Mirko Casu, Sebastiano Battiato
Abstract:
Detecting fraudulent credit card transactions remains a significant challenge, due to the extreme class imbalance in real-world data and the often subtle patterns that separate fraud from legitimate activity. Existing research commonly attempts to address this by generating synthetic samples for the minority class using approaches such as GANs, VAEs, or hybrid generative models. However, these techniques, particularly when applied only to minority-class data, tend to result in overconfident classifiers and poor latent cluster separation, ultimately limiting real-world detection performance. In this study, we propose the Causal Prototype Attention Classifier (CPAC), an interpretable architecture that promotes class-aware clustering and improved latent space structure through prototype-based attention mechanisms and we will couple it with the encoder in a VAE-GAN allowing it to offer a better cluster separation moving beyond post-hoc sample augmentation. We compared CPAC-augmented models to traditional oversamplers, such as SMOTE, as well as to state-of-the-art generative models, both with and without CPAC-based latent classifiers. Our results show that classifier-guided latent shaping with CPAC delivers superior performance, achieving an F1-score of 93.14\% percent and recall of 90.18\%, along with improved latent cluster separation. Further ablation studies and visualizations provide deeper insight into the benefits and limitations of classifier-driven representation learning for fraud detection. The codebase for this work will be available at final submission.
Chinese: 本研究提出因果原型注意力分类器(CPAC),通过改进潜在空间聚类结构,在欺诈检测中实现了93.14%的F1分数和90.18%的召回率,显著提升了检测性能。
English: This study introduces the Causal Prototype Attention Classifier (CPAC), an interpretable model that enhances fraud detection by improving latent cluster separation and achieving superior performance metrics, including a 93.14% F1-score and 90.18% recall.
Authors:Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, Ethan Perez
Abstract:
We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.
中文摘要:研究发现,增加大型推理模型的推理长度反而会降低其性能,揭示了五种失效模式,包括模型分心、过拟合及问题行为加剧,强调了评估不同推理长度的重要性。
English Summary: The study reveals that increasing reasoning length in Large Reasoning Models can inversely affect performance, highlighting five failure modes where extended reasoning leads to distraction, overfitting, and amplified problematic behaviors.
Authors:Martin V. Vejling, Shashi Raj Pandey, Christophe A. N. Biscio, Petar Popovski
Abstract:
The amount of quality data in many machine learning tasks is limited to what is available locally to data owners. The set of quality data can be expanded through trading or sharing with external data agents. However, data buyers need quality guarantees before purchasing, as external data may be contaminated or irrelevant to their specific learning task. Previous works primarily rely on distributional assumptions about data from different agents, relegating quality checks to post-hoc steps involving costly data valuation procedures. We propose a distribution-free, contamination-aware data-sharing framework that identifies external data agents whose data is most valuable for model personalization. To achieve this, we introduce novel two-sample testing procedures, grounded in rigorous theoretical foundations for conformal outlier detection, to determine whether an agent's data exceeds a contamination threshold. The proposed tests, termed conformal data contamination tests, remain valid under arbitrary contamination levels while enabling false discovery rate control via the Benjamini-Hochberg procedure. Empirical evaluations across diverse collaborative learning scenarios demonstrate the robustness and effectiveness of our approach. Overall, the conformal data contamination test distinguishes itself as a generic procedure for aggregating data with statistically rigorous quality guarantees.
中文: 本文提出了一种无分布、污染感知的数据共享框架,通过新型的保形数据污染测试来识别对模型个性化最有价值的外部数据代理,在不依赖分布假设的前提下提供统计上的质量保证。
English: This paper introduces a distribution-free, contamination-aware data-sharing framework that uses novel conformal data contamination tests to identify valuable external data agents for model personalization, ensuring statistical quality guarantees without relying on distributional assumptions.
Authors:Nikita Koriagin, Yaroslav Aksenov, Daniil Laptev, Gleb Gerasimov, Nikita Balagansky, Daniil Gavrilov
Abstract:
Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.
中文: 本文提出一种残差学习方法,通过训练辅助稀疏自编码器来捕捉主模型遗漏的领域特征,无需完整重训练即可显著提升专业领域解释性和性能。
English: This paper introduces a residual learning method that trains a secondary sparse autoencoder to capture domain-specific features missed by a primary model, significantly improving interpretability and performance in specialized domains without full retraining.
Authors:Yafei Zhang, Lingqi Kong, Huafeng Li, Jie Wen
Abstract:
To reduce the reliance of visible-infrared person re-identification (ReID) models on labeled cross-modal samples, this paper explores a weakly supervised cross-modal person ReID method that uses only single-modal sample identity labels, addressing scenarios where cross-modal identity labels are unavailable. To mitigate the impact of missing cross-modal labels on model performance, we propose a heterogeneous expert collaborative consistency learning framework, designed to establish robust cross-modal identity correspondences in a weakly supervised manner. This framework leverages labeled data from each modality to independently train dedicated classification experts. To associate cross-modal samples, these classification experts act as heterogeneous predictors, predicting the identities of samples from the other modality. To improve prediction accuracy, we design a cross-modal relationship fusion mechanism that effectively integrates predictions from different experts. Under the implicit supervision provided by cross-modal identity correspondences, collaborative and consistent learning among the experts is encouraged, significantly enhancing the model's ability to extract modality-invariant features and improve cross-modal identity recognition. Experimental results on two challenging datasets validate the effectiveness of the proposed method.
中文摘要:本文提出一种弱监督跨模态行人重识别方法,仅使用单模态标签,通过异构专家协同学习框架建立跨模态身份对应关系,有效提升模型对跨模态身份识别的能力。
English Summary: This paper introduces a weakly supervised cross-modal person re-identification method that uses only single-modal labels, proposing a heterogeneous expert collaborative learning framework to establish robust cross-modal correspondences and enhance modality-invariant feature extraction.
Authors:Yiqi Tian, Pengfei Jin, Mingze Yuan, Na Li, Bo Zeng, Quanzheng Li
Abstract:
Diffusion models have achieved state-of-the-art performance in generative modeling, yet their sampling procedures remain vulnerable to hallucinations, often stemming from inaccuracies in score approximation. In this work, we reinterpret diffusion sampling through the lens of optimization and introduce RODS (Robust Optimization-inspired Diffusion Sampler), a novel method that detects and corrects high-risk sampling steps using geometric cues from the loss landscape. RODS enforces smoother sampling trajectories and adaptively adjusts perturbations, reducing hallucinations without retraining and at minimal additional inference cost. Experiments on AFHQv2, FFHQ, and 11k-hands demonstrate that RODS improves both sampling fidelity and robustness, detecting over 70% of hallucinated samples and correcting more than 25%, all while avoiding the introduction of new artifacts.
中文: RODS是一种新颖的扩散采样器,通过几何优化线索检测并修正幻觉现象,无需重新训练即可提升采样保真度和鲁棒性,且计算开销极小。
English: RODS is a novel diffusion sampler that detects and corrects hallucinations using geometric optimization cues, enhancing sampling fidelity and robustness without retraining or significant computational overhead.
Authors:Mst Shamima Aktar, Peng Liang, Muhammad Waseem, Amjed Tahir, Mojtaba Shahin, Muhammad Azeem Akbar, Arif Ali Khan, Aakash Ahmad, Musengamana Jean de Dieu, Ruiyin Li
Abstract:
Quantum software represents disruptive technologies in terms of quantum-specific software systems, services, and applications - leverage the principles of quantum mechanics via programmable quantum bits (Qubits) that manipulate quantum gates (QuGates) - to achieve quantum supremacy in computing. Quantum software architecture enables quantum software developers to abstract away implementation-specific details (i.e., mapping of Qubits and QuGates to high-level architectural components and connectors). Architectural patterns and strategies can provide reusable knowledge and best practices to engineer quantum software systems effectively and efficiently. However, quantum software practitioners face significant challenges in selecting and implementing appropriate patterns and strategies due to the complexity of quantum software systems and the lack of guidelines. To address these challenges, this study proposes decision models for selecting patterns and strategies in six critical design areas in quantum software systems: Communication, Decomposition, Data Processing, Fault Tolerance, Integration and Optimization, and Algorithm Implementation. These decision models are constructed based on data collected from both a mining study (i.e., GitHub and Stack Exchange) and a Systematic Literature Review, which were used to identify relevant patterns and strategies with their involved Quality Attributes (QAs). We then conducted semi-structured interviews with 16 quantum software practitioners to evaluate the familiarity, understandability, completeness, and usefulness of the proposed decision models. The results show that the proposed decision models can aid practitioners in selecting suitable patterns and strategies to address the challenges related to the architecture design of quantum software systems. The dataset is available at [6], allowing the community to reproduce and build upon our findings.
中文: 本研究提出决策模型,帮助量子软件从业者在六个关键设计领域选择架构模式与策略,通过访谈验证其能有效应对量子软件复杂性挑战。
English: This study introduces decision models to help quantum software practitioners select architectural patterns and strategies across six key design areas, addressing challenges in quantum software complexity and validated through interviews showing their effectiveness.
Authors:Sunghyun Park, Jungsoo Lee, Shubhankar Borse, Munawar Hayat, Sungha Choi, Kyuwoong Hwang, Fatih Porikli
Abstract:
While open-vocabulary semantic segmentation (OVSS) can segment an image into semantic regions based on arbitrarily given text descriptions even for classes unseen during training, it fails to understand personal texts (e.g., `my mug cup') for segmenting regions of specific interest to users. This paper addresses challenges like recognizing `my mug cup' among `multiple mug cups'. To overcome this challenge, we introduce a novel task termed \textit{personalized open-vocabulary semantic segmentation} and propose a text prompt tuning-based plug-in method designed to recognize personal visual concepts using a few pairs of images and masks, while maintaining the performance of the original OVSS. Based on the observation that reducing false predictions is essential when applying text prompt tuning to this task, our proposed method employs `negative mask proposal' that captures visual concepts other than the personalized concept. We further improve the performance by enriching the representation of text prompts by injecting visual embeddings of the personal concept into them. This approach enhances personalized OVSS without compromising the original OVSS performance. We demonstrate the superiority of our method on our newly established benchmarks for this task, including FSS$^\text{per}$, CUB$^\text{per}$, and ADE$^\text{per}$.
中文: 本文提出了个性化开放词汇语义分割这一新任务,通过文本提示调优和负掩码提议方法,能在不损害原有模型性能的前提下,精准分割用户专属物品(如"我的马克杯")。
English: This paper introduces personalized open-vocabulary semantic segmentation, a novel task that uses text prompt tuning and negative mask proposals to accurately segment user-specific items like "my mug cup" without compromising the original model's performance on general classes.
Authors:Ashutosh Mishra, Shreya Santra, Hazal Gozbasi, Kentaro Uno, Kazuya Yoshida
Abstract:
This study presents an advanced approach to enhance robotic manipulation in uncertain and challenging environments, with a focus on autonomous operations augmented by human-in-the-loop (HITL) control for lunar missions. By integrating human decision-making with autonomous robotic functions, the research improves task reliability and efficiency for space applications. The key task addressed is the autonomous deployment of flexible solar panels using an extendable ladder-like structure and a robotic manipulator with real-time feedback for precision. The manipulator relays position and force-torque data, enabling dynamic error detection and adaptive control during deployment. To mitigate the effects of sinkage, variable payload, and low-lighting conditions, efficient motion planning strategies are employed, supplemented by human control that allows operators to intervene in ambiguous scenarios. Digital twin simulation enhances system robustness by enabling continuous feedback, iterative task refinement, and seamless integration with the deployment pipeline. The system has been tested to validate its performance in simulated lunar conditions and ensure reliability in extreme lighting, variable terrain, changing payloads, and sensor limitations.
中文: 本研究提出一种人机协同增强的机器人系统,用于在月球上自主部署柔性太阳能板,通过实时反馈和数字孪生仿真技术,确保在极端月球环境下的精准操作与系统可靠性。
English: This research introduces a human-in-the-loop enhanced robotic system for autonomous deployment of flexible solar panels on the Moon, utilizing real-time feedback and digital twin simulations to ensure precision and reliability in challenging lunar conditions.
Authors:Bo Jiang, Xueyang Ze, Beibei Wang, Xixi Wang, Xixi Wan, Bin Luo
Abstract:
Textual adapter-based tuning methods have shown significant potential in transferring knowledge from pre-trained Vision-Language Models (VLMs) to downstream tasks. Existing works generally employ the deterministic textual feature adapter to refine each category textual representation. However, due to inherent factors such as different attributes and contexts, there exists significant diversity in textual descriptions for each category. Such description diversity offers rich discriminative semantic knowledge that can benefit downstream visual learning tasks. Obviously, traditional deterministic adapter model cannot adequately capture this varied semantic information. Also, it is desirable to exploit the inter-class relationships in VLM adapter. To address these issues, we propose to exploit random graph model into VLM adapter and develop a novel Vertex Random Graph Adapter (VRGAdapter). VRGAdapter first models the inherent diverse descriptions of each category and inter-class relationships of different categories simultaneously by leveraging a Vertex Random Knowledge Graph (VRKG) model. Then, it employs probabilistic message propagation on VRKG to learn context-aware distribution representation for each class node. Finally, it adopts a reparameterized sampling function to achieve textual adapter learning. Note that, VRGAdapter provides a more general adapter solution that encompasses traditional graph-based adapter as a special case. In addition, to enable more robust performance for downstream tasks, we also introduce a new Uncertainty-guided Multi-branch Fusion (UMF) scheme that dynamically integrates multiple pre-trained models for ensemble prediction. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our approach.
中文摘要:提出的VRGAdapter方法通过顶点随机图建模多样化的文本描述和类间关系,改进了视觉语言模型的适配能力,同时采用不确定性引导的多分支融合策略动态集成多个预训练模型,从而提升下游任务的鲁棒性。
English Summary: The proposed VRGAdapter method enhances vision-language model adaptation by modeling diverse textual descriptions and inter-class relationships using a vertex random graph, while an uncertainty-guided fusion scheme dynamically combines multiple models for robust downstream task performance.
Authors:Saif Khan Mohammed, Amit Kumar Pathak, Muhammad Ubadah, Ronny Hadani, Ananthanarayanan Chockalingam, Robert Calderbank
Abstract:
In Zak-OTFS (orthogonal time frequency space) modulation the carrier waveform is a pulse in the delay-Doppler (DD) domain, formally a quasi-periodic localized function with specific periods along delay and Doppler. When the channel delay spread is less than the delay period, and the channel Doppler spread is less than the Doppler period, the response to a single Zak-OTFS carrier provides an image of the scattering environment and can be used to predict the effective channel at all other carriers. The image of the scattering environment changes slowly, making it possible to employ precoding at the transmitter. Precoding techniques were developed more than thirty years ago for wireline modem channels (V.34 standard) defined by linear convolution where a pulse in the time domain (TD) is used to probe the one-dimensional partial response channel. The action of a doubly spread channel on Zak-OTFS modulation determines a two-dimensional partial response channel defined by twisted convolution, and we develop a novel precoding technique for this channel. The proposed precoder leads to separate equalization of each DD carrier which has significantly lower complexity than joint equalization of all carriers. Further, the effective precoded channel results in non-interfering DD carriers which significantly reduces the overhead of guard carriers separating data and pilot carriers, which improves the spectral efficiency significantly.
中文摘要:Zak-OTFS调制利用时延-多普勒域脉冲对散射环境进行成像,通过新型预编码技术实现载波间无干扰传输,大幅降低了均衡复杂度并显著提升频谱效率。
English Summary: Zak-OTFS modulation uses delay-Doppler domain pulses to image scattering environments and enables low-complexity precoding that eliminates interference between carriers, significantly improving spectral efficiency.
Authors:Jintao Qu, Zichong Wang, Chenhao Wu, Wenbin Zhang
Abstract:
Neural networks have achieved remarkable success in time series classification, but their reliance on large amounts of labeled data for training limits their applicability in cold-start scenarios. Moreover, they lack interpretability, reducing transparency in decision-making. In contrast, dynamic time warping (DTW) combined with a nearest neighbor classifier is widely used for its effectiveness in limited-data settings and its inherent interpretability. However, as a non-parametric method, it is not trainable and cannot leverage large amounts of labeled data, making it less effective than neural networks in rich-resource scenarios. In this work, we aim to develop a versatile model that adapts to cold-start conditions and becomes trainable with labeled data, while maintaining interpretability. We propose a dynamic length-shortening algorithm that transforms time series into prototypes while preserving key structural patterns, thereby enabling the reformulation of the DTW recurrence relation into an equivalent recurrent neural network. Based on this, we construct a trainable model that mimics DTW's alignment behavior. As a neural network, it becomes trainable when sufficient labeled data is available, while still retaining DTW's inherent interpretability. We apply the model to several benchmark time series classification tasks and observe that it significantly outperforms previous approaches in low-resource settings and remains competitive in rich-resource settings.
中文: 本研究提出了一种可训练的神经网络模型,模拟动态时间规整进行时间序列分类,在低资源场景下表现优异且保持可解释性,在数据充足时仍具竞争力。
English: This work introduces a trainable neural network model that emulates dynamic time warping for time series classification, achieving superior performance in low-resource scenarios while maintaining interpretability and remaining competitive with abundant data.
Authors:Yongqian Sun, Weihua Kuang, Chao Shen, Xidao Wen, Tinghua Zheng, Heng Liu, Shenglin Zhang, Bo Wu, Dan Pei
Abstract:
In modern online services, frequent software changes introduce significant risks. To tackle this challenge, we propose SCELM (Software Change Evaluation and Lifecycle Management), an end-to-end automated framework for software change management. SCELM aims to manage software changes efficiently and precisely, significantly reducing service failures and economic losses.
中文: SCELM是一个自动化框架,旨在高效管理软件变更,显著减少现代在线服务中的故障和经济损失。
English: SCELM is an automated framework designed to manage software changes effectively, reducing service failures and economic losses in modern online services.
Authors:Yanan Cao, Omid Memarrast, Shiqin Cai, Sinduja Subramaniam, Evren Korpeoglu, Kannan Achan
Abstract:
In grocery e-commerce, customers often build ingredient baskets guided by dietary preferences but lack the expertise to create complete meals. Leveraging recipe knowledge to recommend complementary ingredients based on a partial basket is essential for improving the culinary experience. Traditional recipe completion methods typically predict a single missing ingredient using a leave-one-out strategy. However, they fall short in two key aspects: (i) they do not reflect real-world scenarios where multiple ingredients are often needed, and (ii) they overlook relationships among the missing ingredients themselves. To address these limitations, we reformulate basket completion as a set-to-set (S2S) recommendation problem, where an incomplete basket is input into a system that predicts a set of complementary ingredients. We introduce S2SRec2, a set-to-set ingredient recommendation framework based on a Set Transformer and trained in a multitask learning paradigm. S2SRec2 jointly learns to (i) retrieve missing ingredients from the representation of existing ones and (ii) assess basket completeness after prediction. These tasks are optimized together, enforcing accurate retrieval and coherent basket completion. Experiments on large-scale recipe datasets and qualitative analyses show that S2SRec2 significantly outperforms single-target baselines, offering a promising approach to enhance grocery shopping and inspire culinary creativity.
Chinese: 该研究提出了S2SRec2框架,通过同时预测多种互补食材并评估购物篮完整性,解决了传统方法只能预测单一食材的局限,显著提升了生鲜电商的用户体验。
English: The study introduces S2SRec2, a set-to-set ingredient recommendation framework that addresses the limitations of traditional methods by predicting multiple complementary ingredients simultaneously and assessing basket completeness, significantly improving grocery e-commerce experiences.
Authors:Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, Cees G. M. Snoek, Yuki M. Asano
Abstract:
We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.
中文: 缓存导向是一种轻量级单次干预方法,通过修改语言模型的键值缓存来引导思维链推理,无需微调或调整提示即可提升推理质量与任务性能。
English: Cache steering is a lightweight, one-shot method that modifies language models' key-value cache to induce chain-of-thought reasoning, improving both reasoning quality and task performance without fine-tuning or prompt changes.
Authors:Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, James R. Glass, Cees G. M. Snoek, Yuki M. Asano
Abstract:
We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach constructs steering vectors from reasoning traces, obtained either from teacher models (e.g., GPT-4o) or existing human annotations, that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Additional experiments show that the method also scales to larger models and yields further gains on challenging datasets such as GPQA and MATH. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of inference latency, hyperparameter stability, and ease of integration with existing inference APIs. Beyond mere reasoning induction, we show that cache steering enables controllable transfer of reasoning styles (e.g., stepwise, causal, analogical), making it a practical tool for behavior-level guidance of language models.
中文: 缓存导向是一种轻量级单次干预方法,通过修改语言模型的键值缓存来引导思维链推理,无需微调或调整提示即可提升推理质量与任务性能。
English: Cache steering is a lightweight, one-shot method that modifies language models' key-value cache to induce chain-of-thought reasoning, improving both reasoning quality and task performance without fine-tuning or prompt changes.
Authors:Wenbo Cui, Chengyang Zhao, Yuhui Chen, Haoran Li, Zhizheng Zhang, Dongbin Zhao, He Wang
Abstract:
Building a robust perception module is crucial for visuomotor policy learning. While recent methods incorporate pre-trained 2D foundation models into robotic perception modules to leverage their strong semantic understanding, they struggle to capture 3D spatial information and generalize across diverse camera viewpoints. These limitations hinder the policy's effectiveness, especially in fine-grained robotic manipulation scenarios. To address these challenges, we propose CL3R, a novel 3D pre-training framework designed to enhance robotic manipulation policies. Our method integrates both spatial awareness and semantic understanding by employing a point cloud Masked Autoencoder to learn rich 3D representations while leveraging pre-trained 2D foundation models through contrastive learning for efficient semantic knowledge transfer. Additionally, we propose a 3D visual representation pre-training framework for robotic tasks. By unifying coordinate systems across datasets and introducing random fusion of multi-view point clouds, we mitigate camera view ambiguity and improve generalization, enabling robust perception from novel viewpoints at test time. Extensive experiments in both simulation and the real world demonstrate the superiority of our method, highlighting its effectiveness in visuomotor policy learning for robotic manipulation.
中文: CL3R是一种新颖的3D预训练框架,通过点云掩码自编码器学习空间信息并结合对比学习迁移语义知识,有效提升了机器人操作策略在不同视角下的泛化能力。
English: CL3R is a novel 3D pre-training framework that enhances robotic manipulation policies by combining spatial awareness from point cloud encoding with semantic understanding through contrastive learning, improving generalization across camera viewpoints.
Authors:Md Sadman Sakib Rahman, Yuhang Li, Xilin Yang, Shiqi Chen, Aydogan Ozcan
Abstract:
Nonlinear computation is essential for a wide range of information processing tasks, yet implementing nonlinear functions using optical systems remains a challenge due to the weak and power-intensive nature of optical nonlinearities. Overcoming this limitation without relying on nonlinear optical materials could unlock unprecedented opportunities for ultrafast and parallel optical computing systems. Here, we demonstrate that large-scale nonlinear computation can be performed using linear optics through optimized diffractive processors composed of passive phase-only surfaces. In this framework, the input variables of nonlinear functions are encoded into the phase of an optical wavefront, e.g., via a spatial light modulator (SLM), and transformed by an optimized diffractive structure with spatially varying point-spread functions to yield output intensities that approximate a large set of unique nonlinear functions, all in parallel. We provide proof establishing that this architecture serves as a universal function approximator for an arbitrary set of bandlimited nonlinear functions, also covering multi-variate and complex-valued functions. We also numerically demonstrate the parallel computation of one million distinct nonlinear functions, accurately executed at wavelength-scale spatial density at the output of a diffractive optical processor. Furthermore, we experimentally validated this framework using in situ optical learning and approximated 35 unique nonlinear functions in a single shot using a compact setup consisting of an SLM and an image sensor. These results establish diffractive optical processors as a scalable platform for massively parallel universal nonlinear function approximation, paving the way for new capabilities in analog optical computing based on linear materials.
中文摘要:本研究通过采用被动纯相位表面的优化衍射处理器,实现了基于线性光学的大规模非线性计算,无需非线性光学材料即可并行近似多种非线性函数。
English Summary: This study demonstrates that large-scale nonlinear computation can be achieved using linear optics through optimized diffractive processors with passive phase-only surfaces, enabling parallel approximation of numerous nonlinear functions without relying on nonlinear optical materials.
Authors:Henry J. Xie, Jinghan Zhang, Xinhao Zhang, Kunpeng Liu
Abstract:
The distillation of knowledge from Large Language Models (LLMs) into Smaller Language Models (SLMs), preserving the capabilities and performance of LLMs while reducing model size, has played a key role in the proliferation of LLMs. Because SLMs are considerably smaller than LLMs, they are often utilized in domains where human interaction is frequent but resources are highly constrained, e.g., smart phones. Therefore, it is crucial to ensure that empathy, a fundamental aspect of positive human interactions, already instilled into LLMs, is retained by SLMs after distillation. In this paper, we develop a comprehensive approach for effective empathy distillation from LLMs into SLMs. Our approach features a two-step fine-tuning process that fully leverages datasets of empathetic dialogue responses distilled from LLMs. We explore several distillation methods beyond basic direct prompting and propose four unique sets of prompts for targeted empathy improvement to significantly enhance the empathy distillation process. Our evaluations demonstrate that SLMs fine-tuned through the two-step fine-tuning process with distillation datasets enhanced by the targeted empathy improvement prompts significantly outperform the base SLM at generating empathetic responses with a win rate of 90%. Our targeted empathy improvement prompts substantially outperform the basic direct prompting with a 10% improvement in win rate.
中文摘要:本文提出了一种采用针对性共情提示的两步微调方法,能有效将大型语言模型的共情能力蒸馏到小型语言模型中,在生成共情回复方面实现了90%的胜率。
English Summary: This paper introduces a two-step fine-tuning method using targeted empathy prompts to effectively distill empathy from Large Language Models into Smaller Language Models, achieving a 90% win rate in generating empathetic responses.
Authors:Michael Galarnyk, Veer Kejriwal, Agam Shah, Yash Bhardwaj, Nicholas Meyer, Anand Krishnan, Sudheer Chava
Abstract:
Social media has amplified the reach of financial influencers known as "finfluencers," who share stock recommendations on platforms like YouTube. Understanding their influence requires analyzing multimodal signals like tone, delivery style, and facial expressions, which extend beyond text-based financial analysis. We introduce VideoConviction, a multimodal dataset with 6,000+ expert annotations, produced through 457 hours of human effort, to benchmark multimodal large language models (MLLMs) and text-based large language models (LLMs) in financial discourse. Our results show that while multimodal inputs improve stock ticker extraction (e.g., extracting Apple's ticker AAPL), both MLLMs and LLMs struggle to distinguish investment actions and conviction--the strength of belief conveyed through confident delivery and detailed reasoning--often misclassifying general commentary as definitive recommendations. While high-conviction recommendations perform better than low-conviction ones, they still underperform the popular S\&P 500 index fund. An inverse strategy--betting against finfluencer recommendations--outperforms the S\&P 500 by 6.8\% in annual returns but carries greater risk (Sharpe ratio of 0.41 vs. 0.65). Our benchmark enables a diverse evaluation of multimodal tasks, comparing model performance on both full video and segmented video inputs. This enables deeper advancements in multimodal financial research. Our code, dataset, and evaluation leaderboard are available under the CC BY-NC 4.0 license.
中文: VideoConviction数据集通过大量人工标注评估多模态和文本大语言模型在金融讨论中的表现,发现多模态输入虽有助于提取股票信息,但模型难以判断投资信心和行动,且逆向策略虽年收益超越标普500指数但风险更高。
English: The VideoConviction dataset, created with extensive human annotations, evaluates multimodal and text-based language models in financial discourse, revealing that while multimodal inputs aid in extracting stock information, models struggle to assess conviction and investment actions, with an inverse strategy outperforming the S&P 500 but at higher risk.
Authors:Xiao Yang, Jiyao Wang, Yuxuan Fan, Can Liu, Houcheng Su, Weichen Guo, Zitong Yu, Dengbo He, Kaishun Wu
Abstract:
Remote physiological measurement (RPM) has emerged as a promising non-invasive method for monitoring physiological signals using the non-contact device. Although various domain adaptation and generalization methods were proposed to promote the adaptability of deep-based RPM models in unseen deployment environments, considerations in aspects such as privacy concerns and real-time adaptation restrict their application in real-world deployment. Thus, we aim to propose a novel fully Test-Time Adaptation (TTA) strategy tailored for RPM tasks in this work. Specifically, based on prior knowledge in physiology and our observations, we noticed not only there is spatio-temporal consistency in the frequency domain of BVP signals, but also that inconsistency in the time domain was significant. Given this, by leveraging both consistency and inconsistency priors, we introduce an innovative expert knowledge-based self-supervised \textbf{C}onsistency-\textbf{i}n\textbf{C}onsistency-\textbf{i}ntegration (\textbf{CiCi}) framework to enhances model adaptation during inference. Besides, our approach further incorporates a gradient dynamic control mechanism to mitigate potential conflicts between priors, ensuring stable adaptation across instances. Through extensive experiments on five diverse datasets under the TTA protocol, our method consistently outperforms existing techniques, presenting state-of-the-art performance in real-time self-supervised adaptation without accessing source data. The code will be released later.
中文: 本文提出了一种名为CiCi框架的全新全测试时自适应策略,利用生理信号中的一致性和不一致性先验知识,在推理过程中增强远程生理测量模型的适应性,在不访问源数据的情况下实现了最先进的性能。
English: This paper introduces a novel fully test-time adaptation strategy called the CiCi framework, which leverages both consistency and inconsistency priors in physiological signals to enhance remote physiological measurement models during inference, achieving state-of-the-art performance without accessing source data.
Authors:Jiayi Wu, Tianfu Wang, Md Abu Bakr Siddique, Md Jahidul Islam, Cornelia Fermuller, Yiannis Aloimonos, Christopher A. Metzler
Abstract:
Underwater image restoration algorithms seek to restore the color, contrast, and appearance of a scene that is imaged underwater. They are a critical tool in applications ranging from marine ecology and aquaculture to underwater construction and archaeology. While existing pixel-domain diffusion-based image restoration approaches are effective at restoring simple scenes with limited depth variation, they are computationally intensive and often generate unrealistic artifacts when applied to scenes with complex geometry and significant depth variation. In this work we overcome these limitations by combining a novel network architecture (SLURPP) with an accurate synthetic data generation pipeline. SLURPP combines pretrained latent diffusion models -- which encode strong priors on the geometry and depth of scenes -- with an explicit scene decomposition -- which allows one to model and account for the effects of light attenuation and backscattering. To train SLURPP we design a physics-based underwater image synthesis pipeline that applies varied and realistic underwater degradation effects to existing terrestrial image datasets. This approach enables the generation of diverse training data with dense medium/degradation annotations. We evaluate our method extensively on both synthetic and real-world benchmarks and demonstrate state-of-the-art performance. Notably, SLURPP is over 200X faster than existing diffusion-based methods while offering ~ 3 dB improvement in PSNR on synthetic benchmarks. It also offers compelling qualitative improvements on real-world data. Project website https://tianfwang.github.io/slurpp/.
中文: 本文提出SLURPP新型水下图像复原方法,通过融合潜在扩散模型与显式场景分解技术,有效解决复杂水下退化问题,在速度和质量上均显著超越现有方法。
English: This paper introduces SLURPP, a novel underwater image restoration method that combines latent diffusion models with explicit scene decomposition to effectively address complex underwater degradation, achieving significant speed and quality improvements over existing approaches.
Authors:Charlie Budd, Silvère Ségaud, Matthew Elliot, Graeme Stasiuk, Yijing Xie, Jonathan Shapey, Tom Vercauteren
Abstract:
Integration of hyperspectral imaging into fluorescence-guided neurosurgery has the potential to improve surgical decision making by providing quantitative fluorescence measurements in real-time. Quantitative fluorescence requires paired spectral data in fluorescence (blue light) and reflectance (white light) mode. Blue and white image acquisition needs to be performed sequentially in a potentially dynamic surgical environment. A key component to the fluorescence quantification process is therefore the ability to find dense cross-modal image correspondences between two hyperspectral images taken under these drastically different lighting conditions. We address this challenge with the introduction of X-RAFT, a Recurrent All-Pairs Field Transforms (RAFT) optical flow model modified for cross-modal inputs. We propose using distinct image encoders for each modality pair, and fine-tune these in a self-supervised manner using flow-cycle-consistency on our neurosurgical hyperspectral data. We show an error reduction of 36.6% across our evaluation metrics when comparing to a naive baseline and 27.83% reduction compared to an existing cross-modal optical flow method (CrossRAFT). Our code and models will be made publicly available after the review process.
中文摘要:本研究提出X-RAFT改进型光流模型,通过自监督训练实现不同光照条件下高光谱图像的跨模态精准配准,在神经外科荧光定量分析中误差降低超36%。
English Summary: The study introduces X-RAFT, an enhanced optical flow model that enables precise cross-modal image alignment between hyperspectral images captured under different lighting conditions, achieving over 36% error reduction for quantitative fluorescence in neurosurgery.
Authors:Haotan Guo, Jianfei He, Jiayuan Ma, Hongbin Na, Zimu Wang, Haiyang Zhang, Qi Chen, Wei Wang, Zijing Shi, Tao Shen, Ling Chen
Abstract:
Phonetic Cloaking Replacement (PCR), defined as the deliberate use of homophonic or near-homophonic variants to hide toxic intent, has become a major obstacle to Chinese content moderation. While this problem is well-recognized, existing evaluations predominantly rely on rule-based, synthetic perturbations that ignore the creativity of real users. We organize PCR into a four-way surface-form taxonomy and compile \ours, a dataset of 500 naturally occurring, phonetically cloaked offensive posts gathered from the RedNote platform. Benchmarking state-of-the-art LLMs on this dataset exposes a serious weakness: the best model reaches only an F1-score of 0.672, and zero-shot chain-of-thought prompting pushes performance even lower. Guided by error analysis, we revisit a Pinyin-based prompting strategy that earlier studies judged ineffective and show that it recovers much of the lost accuracy. This study offers the first comprehensive taxonomy of Chinese PCR, a realistic benchmark that reveals current detectors' limits, and a lightweight mitigation technique that advances research on robust toxicity detection.
中文摘要:语音伪装替换(PCR)通过同音词掩盖中文恶意内容,给内容审核带来挑战,本研究构建的真实数据集揭示了先进检测模型的严重缺陷,同时证明拼音提示策略能有效提升识别准确率。
English Summary: Phonetic Cloaking Replacement (PCR) uses homophones to disguise toxic content in Chinese, challenging moderation systems, and this study introduces a realistic dataset revealing major detection gaps in advanced models while demonstrating that Pinyin-based prompting significantly improves accuracy.
Authors:Zhitao Wang, Hengyu Man, Wenrui Li, Xingtao Wang, Xiaopeng Fan, Debin Zhao
Abstract:
Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding for Ultra-Low Bitrate (ULB) scenarios by leveraging powerful generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or excessive dependence on high-level text guidance, which tend to inadequately capture fine-grained motion details, leading to unrealistic or incoherent reconstructions. To address these challenges, we propose Trajectory-Guided Generative Video Coding (dubbed T-GVC), a novel framework that bridges low-level motion tracking with high-level semantic understanding. T-GVC features a semantic-aware sparse motion sampling pipeline that extracts pixel-wise motion as sparse trajectory points based on their semantic importance, significantly reducing the bitrate while preserving critical temporal semantic information. In addition, by integrating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free guidance mechanism in latent space to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that T-GVC outperforms both traditional and neural video codecs under ULB conditions. Furthermore, additional experiments confirm that our framework achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.
中文摘要:本文提出T-GVC框架,通过语义感知稀疏运动采样和轨迹对齐的潜空间引导机制,在超低码率下实现优于传统方法的视频重建质量,为几何运动建模引导的生成式视频编码开辟了新方向。
English Summary: The paper introduces T-GVC, a novel generative video coding framework that combines low-level motion tracking with semantic understanding through sparse trajectory sampling and diffusion-based guidance to achieve superior performance under ultra-low bitrate conditions.
Authors:Nermin Samet, Gilles Puy, Renaud Marlet
Abstract:
We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins.
中文:LOSC方法通过整合具有时空一致性和增强鲁棒性的图像衍生标签,显著提升了激光雷达扫描的零样本分割性能,在多个驾驶数据集上超越现有最佳成果。
English: The proposed LOSC method enhances lidar scan segmentation by consolidating noisy image-derived labels with spatio-temporal consistency and augmentation robustness, achieving state-of-the-art zero-shot performance on driving datasets.
Authors:Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille, Chen Wei, Junfei Xiao
Abstract:
Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.
中文: 本文提出了一种成本高效的视觉-语言-视觉自编码器框架,通过利用预训练组件,仅需少量数据和低于1000美元的训练成本即可构建性能媲美顶尖模型的描述生成系统。
English: This paper introduces a cost-efficient Vision-Language-Vision auto-encoder framework that leverages pretrained components to build state-of-the-art captioners with minimal data and under $1,000 in training costs.
Authors:Fareya Ikram, Alexander Scarlatos, Andrew Lan
Abstract:
Tutoring dialogues have gained significant attention in recent years, given the prominence of online learning and the emerging tutoring abilities of artificial intelligence (AI) agents powered by large language models (LLMs). Recent studies have shown that the strategies used by tutors can have significant effects on student outcomes, necessitating methods to predict how tutors will behave and how their actions impact students. However, few works have studied predicting tutor strategy in dialogues. Therefore, in this work we investigate the ability of modern LLMs, particularly Llama 3 and GPT-4o, to predict both future tutor moves and student outcomes in dialogues, using two math tutoring dialogue datasets. We find that even state-of-the-art LLMs struggle to predict future tutor strategy while tutor strategy is highly indicative of student outcomes, outlining a need for more powerful methods to approach this task.
Chinese: 尽管AI辅导中导师策略对学生成果至关重要,但当前大型语言模型如Llama 3和GPT-4o在预测这些策略方面仍面临挑战,突显了开发更有效方法的必要性。
English: Recent advances in AI-driven tutoring highlight the importance of tutor strategies in student outcomes, yet current large language models like Llama 3 and GPT-4o still struggle to predict these strategies effectively despite their clear impact on learning.
Authors:Yafei Zhang, Yongle Shang, Huafeng Li
Abstract:
Weakly supervised text-to-person image matching, as a crucial approach to reducing models' reliance on large-scale manually labeled samples, holds significant research value. However, existing methods struggle to predict complex one-to-many identity relationships, severely limiting performance improvements. To address this challenge, we propose a local-and-global dual-granularity identity association mechanism. Specifically, at the local level, we explicitly establish cross-modal identity relationships within a batch, reinforcing identity constraints across different modalities and enabling the model to better capture subtle differences and correlations. At the global level, we construct a dynamic cross-modal identity association network with the visual modality as the anchor and introduce a confidence-based dynamic adjustment mechanism, effectively enhancing the model's ability to identify weakly associated samples while improving overall sensitivity. Additionally, we propose an information-asymmetric sample pair construction method combined with consistency learning to tackle hard sample mining and enhance model robustness. Experimental results demonstrate that the proposed method substantially boosts cross-modal matching accuracy, providing an efficient and practical solution for text-to-person image matching.
中文摘要:本文提出一种局部与全局双粒度身份关联机制,通过建立跨模态关联和动态调整策略,有效提升弱监督文本-行人图像匹配的准确性和鲁棒性。
English Summary: This paper introduces a local-and-global dual-granularity identity association mechanism that enhances weakly supervised text-to-person image matching by establishing cross-modal relationships and dynamic adjustment, significantly improving matching accuracy and robustness.
Authors:Ngoc Luyen Le, Marie-Hélène Abel, Bertrand Laforge
Abstract:
The rapid transformation of the labor market, driven by technological advancements and the digital economy, requires continuous competence development and constant adaptation. In this context, traditional competence management systems lack interoperability, adaptability, and semantic understanding, making it difficult to align individual competencies with labor market needs and training programs. This paper proposes an ontology-based framework for competence management, enabling a structured representation of competencies, occupations, and training programs. By leveraging ontological models and semantic reasoning, this framework aims to enhance the automation of competence-to-job matching, the personalization of learning recommendations, and career planning. This study discusses the design, implementation, and potential applications of the framework, focusing on competence training programs, job searching, and finding competent individuals.
中文: 本文提出了一种基于本体的能力管理框架,通过语义推理改进传统系统的不足,旨在优化职位匹配、个性化学习和职业规划。
English: This paper introduces an ontology-based framework for managing competencies to address the limitations of traditional systems, enhancing job matching, personalized learning, and career planning through semantic reasoning.
Authors:Bei Zhou, Zhouheng Li, Lei Xie, Hongye Su, Johannes Betz
Abstract:
Inertia drift is a transitional maneuver between two sustained drift stages in opposite directions, which provides valuable insights for navigating consecutive sharp corners for autonomous racing.However, this can be a challenging scenario for the drift controller to handle rapid transitions between opposing sideslip angles while maintaining accurate path tracking. Moreover, accurate drift control depends on a high-fidelity vehicle model to derive drift equilibrium points and predict vehicle states, but this is often compromised by the strongly coupled longitudinal-lateral drift dynamics and unpredictable environmental variations. To address these challenges, this paper proposes a learning-based planning and control framework utilizing Bayesian optimization (BO), which develops a planning logic to ensure a smooth transition and minimal velocity loss between inertia and sustained drift phases. BO is further employed to learn a performance-driven control policy that mitigates modeling errors for enhanced system performance. Simulation results on an 8-shape reference path demonstrate that the proposed framework can achieve smooth and stable inertia drift through sharp corners.
中文: 本文提出了一种基于贝叶斯优化的学习型规划与控制框架,通过克服建模误差和快速侧滑角转换的挑战,实现了自动驾驶赛车在急弯中平稳、稳定的惯性漂移过渡。
English: This paper introduces a learning-based planning and control framework using Bayesian optimization to enable smooth, stable inertia drift transitions in autonomous racing by overcoming modeling challenges and rapid sideslip angle changes.
Authors:Qianpiao Ma, Junlong Zhou, Xiangpeng Hou, Jianchun Liu, Hongli Xu, Jianeng Miao, Qingmin Jia
Abstract:
Federated learning (FL) is a new paradigm to train AI models over distributed edge devices (i.e., workers) using their local data, while confronting various challenges including communication resource constraints, edge heterogeneity and data Non-IID. Over-the-air computation (AirComp) is a promising technique to achieve efficient utilization of communication resource for model aggregation by leveraging the superposition property of a wireless multiple access channel (MAC). However, AirComp requires strict synchronization among edge devices, which is hard to achieve in heterogeneous scenarios. In this paper, we propose an AirComp-based grouping asynchronous federated learning mechanism (Air-FedGA), which combines the advantages of AirComp and asynchronous FL to address the communication and heterogeneity challenges. Specifically, Air-FedGA organizes workers into groups and performs over-the-air aggregation within each group, while groups asynchronously communicate with the parameter server to update the global model. In this way, Air-FedGA accelerates the FL model training by over-the-air aggregation, while relaxing the synchronization requirement of this aggregation technology. We theoretically prove the convergence of Air-FedGA. We formulate a training time minimization problem for Air-FedGA and propose the power control and worker grouping algorithm to solve it, which jointly optimizes the power scaling factors at edge devices, the denoising factors at the parameter server, as well as the worker grouping strategy. We conduct experiments on classical models and datasets, and the results demonstrate that our proposed mechanism and algorithm can speed up FL model training by 29.9%-71.6% compared with the state-of-the-art solutions.
中文: 提出的Air-FedGA机制将空中计算与异步联邦学习相结合,通过优化的功率控制和分组策略,在解决通信约束和设备异构性问题的同时,将联邦学习模型训练速度提升了29.9%-71.6%。
English: The proposed Air-FedGA mechanism combines over-the-air computation with asynchronous federated learning to accelerate model training by 29.9%-71.6% while addressing communication constraints and device heterogeneity through optimized power control and worker grouping.
Authors:Jiahui Wang, Qin Xu, Bo Jiang, Bin Luo
Abstract:
Prompt learning methods have significantly extended the transferability of pre-trained Vision-Language Models (VLMs) like CLIP for various downstream tasks. These methods adopt handcraft templates or learnable vectors to provide text or image instructions in fine-tuning VLMs. However, most existing works ignore the structural relationships between learnable prompts and tokens within and between modalities. Moreover, balancing the performance of base and new classes remains a significant challenge. In this paper, we propose an Integrated Structural Prompt (ISP) for VLMs to enhance the interaction of information representations between the text and image branches. ISP introduces self-structural and cross-structural prompt modules to model the structural relationships between learnable prompts and frozen tokens within and across modalities. This enables efficient information transfer while preserving feature stability. Additionally, we propose a sample probing module that dynamically adjusts loss coefficients based on sample difficulty, preventing the mode from overfitting to simple samples and improving generalization ability to new classes. Extensive experiments on three widely used settings: base-to-new generalization, cross-dataset evaluation, and domain generalization demonstrate that the proposed ISP achieves competitive performance against state-of-the-art methods.
中文摘要:本文提出的集成结构提示(ISP)方法通过自结构和跨结构提示模块增强视觉语言模型的跨模态交互,同时采用样本探测模块动态调整训练过程,在多种实验设置中显著提升了模型的泛化性能。
English Summary: This paper introduces an Integrated Structural Prompt (ISP) method that enhances cross-modal interactions in vision-language models through self-structural and cross-structural prompt modules, while a sample probing module dynamically adjusts training to improve generalization across diverse experimental settings.
Authors:Jiahui Wang, Qin Xu, Bo Jiang, Bin Luo
Abstract:
Pre-trained large vision-language models (VLMs) like CLIP demonstrate impressive generalization ability. Existing prompt-based and adapter-based works have made significant progress in fine-tuning VLMs but still face the challenges of maintaining strong generalization abilities, particularly towards unseen new classes. This limitation partly arises from these methods treating all tokens of the image and text encoder equally, which can lead to overfitting on less informative features (e.g., background noise, template words) and degrade the general representations that are crucial for novel concept recognition. To address this issue, we propose Dynamic Rank Adaptation (DRA), a novel adapter variant method, designed specifically to enhance new class generalization. DRA dynamically allocates adaptation ranks based on the importance of features during training to preserve general knowledge. DRA first employs token importance grouping, using sequence attention to evaluate and group tokens by their importance. Then, we adopt rank adaptation according to the importance of each token group dynamically by assigning higher feature ranks to the more important tokens. Also, we design a new channel response mechanism to prioritize the preservation and adaptation of feature channels identified as the most informative for each instance. In addition, a L1 regularization term is introduced to stabilize the training. Extensive experiments demonstrate the effectiveness and superiority of our proposed DRA over existing works, especially on enhancing the performance of new classes on various benchmarks, including base-new classes, cross-datasets evaluation and domain generalization. The source code will be published after the paper is received.
Chinese: 本文提出动态秩适应(DRA)方法,通过基于令牌重要性动态分配特征秩来增强视觉语言模型对新类别的泛化能力,在多项基准测试中优于现有方法。
English: This paper introduces Dynamic Rank Adaptation (DRA), a novel adapter method that dynamically adjusts feature ranks based on token importance to enhance the generalization of vision-language models to unseen classes, outperforming existing approaches across multiple benchmarks.
Authors:Yuhang Li, Shiqi Chen, Tingyu Gong, Aydogan Ozcan
Abstract:
Optical computing holds promise for high-speed, energy-efficient information processing, with diffractive optical networks emerging as a flexible platform for implementing task-specific transformations. A challenge, however, is the effective optimization and alignment of the diffractive layers, which is hindered by the difficulty of accurately modeling physical systems with their inherent hardware imperfections, noise, and misalignments. While existing in situ optimization methods offer the advantage of direct training on the physical system without explicit system modeling, they are often limited by slow convergence and unstable performance due to inefficient use of limited measurement data. Here, we introduce a model-free reinforcement learning approach utilizing Proximal Policy Optimization (PPO) for the in situ training of diffractive optical processors. PPO efficiently reuses in situ measurement data and constrains policy updates to ensure more stable and faster convergence. We experimentally validated our method across a range of in situ learning tasks, including targeted energy focusing through a random diffuser, holographic image generation, aberration correction, and optical image classification, demonstrating in each task better convergence and performance. Our strategy operates directly on the physical system and naturally accounts for unknown real-world imperfections, eliminating the need for prior system knowledge or modeling. By enabling faster and more accurate training under realistic experimental constraints, this in situ reinforcement learning approach could offer a scalable framework for various optical and physical systems governed by complex, feedback-driven dynamics.
中文: 本文提出了一种利用近端策略优化的无模型强化学习方法,通过高效复用测量数据并无需先验系统建模即可适应实际缺陷,实现了衍射光学处理器更快、更稳定的训练收敛。
English: This paper introduces a model-free reinforcement learning method using Proximal Policy Optimization for training diffractive optical processors, which enables faster and more stable convergence by efficiently reusing measurement data and accounting for real-world imperfections without prior system modeling.
Authors:Alex ZH Dou, Zhongwei Wan, Dongfei Cui, Xin Wang, Jing Xiong, Haokun Lin, Chaofan Tao, Shen Yan, Mi Zhang
Abstract:
Test-time scaling has emerged as a promising paradigm in language modeling, leveraging additional computational resources at inference time to enhance model performance. In this work, we introduce R2-LLMs, a novel and versatile hierarchical retrieval-augmented reasoning framework designed to improve test-time scaling in large language models (LLMs) without requiring distillation from more advanced models to obtain chain-of-thought (CoT) training data. R2-LLMs enhances inference-time generalization by integrating dual-level retrieval-based in-context learning: (1) At the coarse level, our approach extracts abstract templates from complex reasoning problems and retrieves similar problem-answer pairs to facilitate high-level in-context learning; (2) At the fine level, during Monte Carlo Tree Search (MCTS), R2-LLMs efficiently retrieves analogous intermediate solution steps from reference mathematical problem datasets, refining step-wise reasoning with the aid of a process reward model (PRM) for scoring. R2-LLMs is a robust hierarchical reasoning-augmentation method that enhances in-context-level reasoning while seamlessly integrating with step-level tree search methods. Utilizing PRM, it refines both candidate generation and decision-making for improved reasoning accuracy. Empirical evaluations on the MATH500, GSM8K, and OlympiadBench-TO datasets achieve substantial relative improvement with an increase of up to 16% using LLaMA-3.1-8B compared to the baselines, showcasing the effectiveness of our approach in complex reasoning tasks.
中文: R2-LLMs是一种新颖的分层检索增强框架,通过双层级上下文学习和蒙特卡洛树搜索提升大语言模型的测试时扩展能力,在复杂推理任务中无需蒸馏训练数据即可实现高达16%的性能提升。
English: R2-LLMs is a hierarchical retrieval-augmented framework that enhances test-time scaling in LLMs through dual-level in-context learning and Monte Carlo Tree Search, achieving up to 16% performance improvement on complex reasoning benchmarks without relying on distilled training data.
Authors:Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang
Abstract:
Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S
Chinese: OpenS2S 是一个完全开源、透明且端到端的大规模语言模型,旨在通过流式交错解码架构和自动化数据构建流程,实现具有丰富副语言多样性的低延迟共情语音交互,并公开数据集、模型权重及代码以推动研究。
English: OpenS2S is a fully open-source, transparent, and end-to-end large-scale language model designed to enable empathetic speech interactions by leveraging a streaming interleaved decoding architecture and an automated data construction pipeline for scalable, high-quality training.
Authors:Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu, Jinwei Gu, Tianfan Xue
Abstract:
Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.
中文: 本研究提出了一种利用低帧率相机的高速4D捕捉系统,通过异步采集方案提升等效帧率,并采用生成模型修复重建伪影。
English: This study introduces a high-speed 4D capture system using low-frame-rate cameras, employing an asynchronous capture scheme to boost effective frame rates and a generative model to correct reconstruction artifacts.
Authors:Jaewook Lee, Alexander Scarlatos, Andrew Lan
Abstract:
Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.
中文摘要:本研究提出了一种生成框架,通过新颖算法学习显式规则来建模助记符创建过程,能够实现可解释的日语词汇记忆方法,尤其适用于零基础学习者。
English Summary: The study introduces a generative framework that models mnemonic creation using explicit rules learned through a novel algorithm, enabling interpretable Japanese vocabulary memorization especially effective for new learners.
Authors:Jaewook Lee, Alexander Scarlatos, Andrew Lan
Abstract:
Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.
中文摘要:本研究提出了一种生成框架,通过新颖算法学习显式规则来建模助记符创建过程,能够实现可解释的日语词汇记忆方法,尤其适用于零基础学习者。
English Summary: The study introduces a generative framework that models mnemonic creation using explicit rules learned through a novel algorithm, enabling interpretable Japanese vocabulary memorization especially effective for new learners.
Authors:Alexander Scarlatos, Nigel Fernandez, Christopher Ormerod, Susan Lottridge, Andrew Lan
Abstract:
Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with a large language model (LLM)-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on two real-world student response datasets, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.
中文: SMART提出了一种新颖方法,通过直接偏好优化将模拟学生与IRT对齐来预测开放性题目的难度,利用改进的能力对齐优势,无需真实学生作答即可超越现有预测方法。
English: SMART introduces a novel method using simulated students aligned with IRT via direct preference optimization to predict open-ended item difficulties, outperforming existing approaches by leveraging improved ability alignment without requiring real student responses.
Authors:Wenhao Li, Xiu Su, Jingyi Wu, Feng Yang, Yang Liu, Yi Chen, Shan You, Chang Xu
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable advancements in numerous areas such as multimedia. However, hallucination issues significantly limit their credibility and application potential. Existing mitigation methods typically rely on external tools or the comparison of multi-round inference, which significantly increase inference time. In this paper, we propose \textbf{SE}lf-\textbf{E}volving \textbf{D}istillation (\textbf{SEED}), which identifies hallucinations within the inner knowledge of LVLMs, isolates and purges them, and then distills the purified knowledge back into the model, enabling self-evolution. Furthermore, we identified that traditional distillation methods are prone to inducing void spaces in the output space of LVLMs. To address this issue, we propose a Mode-Seeking Evolving approach, which performs distillation to capture the dominant modes of the purified knowledge distribution, thereby avoiding the chaotic results that could emerge from void spaces. Moreover, we introduce a Hallucination Elimination Adapter, which corrects the dark knowledge of the original model by learning purified knowledge. Extensive experiments on multiple benchmarks validate the superiority of our SEED, demonstrating substantial improvements in mitigating hallucinations for representative LVLM models such as LLaVA-1.5 and InternVL2. Remarkably, the F1 score of LLaVA-1.5 on the hallucination evaluation metric POPE-Random improved from 81.3 to 88.3.
中文: 本文提出的SEED方法通过识别并清除大型视觉语言模型内部幻觉知识,再经蒸馏实现自我进化,在POPE-Random基准测试中使LLaVA-1.5模型的F1分数从81.3显著提升至88.3。
English: The proposed SEED method enables Large Vision-Language Models to self-evolve by identifying and purging internal hallucinations through knowledge distillation, significantly enhancing performance on benchmarks like POPE-Random where LLaVA-1.5's F1 score improved from 81.3 to 88.3.
Authors:Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, Yadan Luo
Abstract:
Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode predictions.Extensive experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.
The proposed DIVER framework enhances autonomous driving by combining reinforcement learning with diffusion-based trajectory generation to overcome the limitations of imitation learning, significantly improving trajectory diversity and generalization in complex scenarios.
English Summary:
Authors:Chen Pang, Xuequan Lu, Qianyu Zhou, Lei Lyu
Abstract:
Most GCN-based methods model interacting individuals as independent graphs, neglecting their inherent inter-dependencies. Although recent approaches utilize predefined interaction adjacency matrices to integrate participants, these matrices fail to adaptively capture the dynamic and context-specific joint interactions across different actions. In this paper, we propose the Active Node Selection with External Attention Network (ASEA), an innovative approach that dynamically captures interaction relationships without predefined assumptions. Our method models each participant individually using a GCN to capture intra-personal relationships, facilitating a detailed representation of their actions. To identify the most relevant nodes for interaction modeling, we introduce the Adaptive Temporal Node Amplitude Calculation (AT-NAC) module, which estimates global node activity by combining spatial motion magnitude with adaptive temporal weighting, thereby highlighting salient motion patterns while reducing irrelevant or redundant information. A learnable threshold, regularized to prevent extreme variations, is defined to selectively identify the most informative nodes for interaction modeling. To capture interactions, we design the External Attention (EA) module to operate on active nodes, effectively modeling the interaction dynamics and semantic relationships between individuals. Extensive evaluations show that our method captures interaction relationships more effectively and flexibly, achieving state-of-the-art performance.
中文总结:ASEA方法通过自适应节点选择和外部注意力模块动态捕捉交互关系,无需预定义矩阵,有效建模个体间互动,实现了最先进的性能。
English Summary: The ASEA method dynamically captures interaction relationships without predefined matrices by using adaptive node selection and external attention to model relevant interactions, achieving state-of-the-art performance.
Authors:Zhengze Ji, Yixuan Huang, Zhixin Chen, Jie Yang, Shi Jin
Abstract:
Parking lot surveillance with integrated sensing and communication (ISAC) system is one of the potential application scenarios defined by 3rd Generation Partnership Project (3GPP). Traditional surveillance systems using cameras or magnetic sensors face limitations such as light dependence, high costs, and constrained scalability. Wireless sensing with reconfigurable intelligent surfaces (RISs) has the ability to address the above limitations due to its light independence and lower deployment overhead. In this study, we propose a difference imaging-based multi-RIS-aided collaborative ISAC system to achieve parking lot surveillance. In a parking lot, the presence of vehicles induces impacts on wireless environments due to scattering characteristic variation. By delineating the parking lot into a two-dimensional image with several grid units, the proposed system can capture the variation of their scattering coefficients in free and occupied states. The variation between these two states is sparse, which can be captured through compressed sensing (CS)-based imaging algorithms. Additionally, we collaboratively employ multiple RISs to enable higher surveillance performance. Experimental results demonstrate that our method can achieve high-accuracy parking occupancy detection, and the employment of collaborative RISs further enhances the detection rate.
中文摘要:本研究提出了一种基于差分成像的多RIS协作ISAC系统,通过压缩感知算法捕捉停车位散射系数的稀疏变化,有效解决了传统监控系统的局限性,实现了高精度的车位状态检测。
English Summary: The study introduces a multi-RIS collaborative ISAC system using compressed sensing imaging to overcome traditional parking surveillance limitations, achieving high-accuracy vehicle detection through sparse scattering variation analysis.
Authors:Zhengkai Lin, Zhihang Fu, Ze Chen, Chao Chen, Liang Xie, Wenxiao Wang, Deng Cai, Zheng Wang, Jieping Ye
Abstract:
Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs' representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-and-play method yields an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.
中文摘要:本研究通过识别控制快慢思维转换的导向向量并应用实时难度评估,使大型推理模型能够动态调整思考速度,以即插即用的方式实现更高准确率和更低计算成本。
English Summary: This work enables Large Reasoning Models to dynamically adjust thinking speed by identifying steering vectors for slow-fast transitions and applying real-time difficulty estimation, achieving higher accuracy with reduced computational costs through a plug-and-play method.
Authors:Jie Peng, Jiarui Ji, Runlin Lei, Zhewei Wei, Yongchao Liu, Chuntao Hong
Abstract:
Dynamic Text-Attributed Graphs (DyTAGs), which intricately integrate structural, temporal, and textual attributes, are crucial for modeling complex real-world systems. However, most of the existing DyTAG datasets exhibit poor textual quality, which severely limits their utility for DyTAG generation tasks requiring semantically rich inputs. Additionally, prior work mainly focuses on discriminative tasks on DyTAGs, resulting in a lack of standardized task formulations and evaluation protocols tailored for DyTAG generation. To address these critical issues, we propose Generative DyTAG Benchmark (GDGB), which comprises eight meticulously curated DyTAG datasets with high-quality textual features for both nodes and edges, overcoming limitations of prior datasets. Building on GDGB, we define two novel DyTAG generation tasks: Transductive Dynamic Graph Generation (TDGG) and Inductive Dynamic Graph Generation (IDGG). TDGG transductively generates a target DyTAG based on the given source and destination node sets, while the more challenging IDGG introduces new node generation to inductively model the dynamic expansion of real-world graph data. To enable holistic evaluation, we design multifaceted metrics that assess the structural, temporal, and textual quality of the generated DyTAGs. We further propose GAG-General, an LLM-based multi-agent generative framework tailored for reproducible and robust benchmarking of DyTAG generation. Experimental results demonstrate that GDGB enables rigorous evaluation of TDGG and IDGG, with key insights revealing the critical interplay of structural and textual features in DyTAG generation. These findings establish GDGB as a foundational resource for advancing generative DyTAG research and unlocking further practical applications in DyTAG generation. GDGB datasets, source codes, and leaderboards are available at \href{https://gdgb-algo.github.io/}{here}.
中文摘要:作者提出了GDGB基准,包含高质量数据集和新型动态文本属性图生成任务,解决了以往在文本质量和评估标准方面的不足。
English Summary: The authors introduce GDGB, a benchmark with high-quality datasets and novel tasks for generating dynamic text-attributed graphs, addressing previous limitations in textual quality and evaluation standards.
Authors:Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, Yueqi Duan
Abstract:
Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.
中文摘要:LangScene-X是一种创新生成框架,通过扩散模型和语言量化压缩技术,仅需稀疏二维视图即可生成外观、几何和语义统一的三维多模态场景,实现高质量重建与开放词汇查询。
English Summary: LangScene-X is a generative framework that creates consistent 3D multi-modality scenes from sparse 2D views by integrating appearance, geometry, and semantics through diffusion models and language encoding, enabling robust reconstruction and open-vocabulary queries.
Authors:Yuzhang Xie, Hejie Cui, Ziyang Zhang, Jiaying Lu, Kai Shu, Fadi Nahab, Xiao Hu, Carl Yang
Abstract:
Medical diagnosis prediction plays a critical role in disease detection and personalized healthcare. While machine learning (ML) models have been widely adopted for this task, their reliance on supervised training limits their ability to generalize to unseen cases, particularly given the high cost of acquiring large, labeled datasets. Large language models (LLMs) have shown promise in leveraging language abilities and biomedical knowledge for diagnosis prediction. However, they often suffer from hallucinations, lack structured medical reasoning, and produce useless outputs. To address these challenges, we propose KERAP, a knowledge graph (KG)-enhanced reasoning approach that improves LLM-based diagnosis prediction through a multi-agent architecture. Our framework consists of a linkage agent for attribute mapping, a retrieval agent for structured knowledge extraction, and a prediction agent that iteratively refines diagnosis predictions. Experimental results demonstrate that KERAP enhances diagnostic reliability efficiently, offering a scalable and interpretable solution for zero-shot medical diagnosis prediction.
中文: KERAP是一种知识图谱增强的多智能体框架,通过整合结构化医疗推理来提高大语言模型的诊断可靠性,为零样本医疗诊断预测提供了可扩展且可解释的解决方案。
English: KERAP is a knowledge graph-enhanced multi-agent framework that improves large language models' diagnostic reliability by integrating structured medical reasoning, offering a scalable and interpretable solution for zero-shot medical diagnosis prediction.
Authors:Le Xu, Qi Zhang, Qixian Zhang, Hongyun Zhang, Duoqian Miao, Cairong Zhao
Abstract:
Recent advances in brain-vision decoding have driven significant progress, reconstructing with high fidelity perceived visual stimuli from neural activity, e.g., functional magnetic resonance imaging (fMRI), in the human visual cortex. Most existing methods decode the brain signal using a two-level strategy, i.e., pixel-level and semantic-level. However, these methods rely heavily on low-level pixel alignment yet lack sufficient and fine-grained semantic alignment, resulting in obvious reconstruction distortions of multiple semantic objects. To better understand the brain's visual perception patterns and how current decoding models process semantic objects, we have developed an experimental framework that uses fMRI representations as intervention conditions. By injecting these representations into multi-scale image features via cross-attention, we compare both downstream performance and intermediate feature changes on object detection and instance segmentation tasks with and without fMRI information. Our results demonstrate that incorporating fMRI signals enhances the accuracy of downstream detection and segmentation, confirming that fMRI contains rich multi-object semantic cues and coarse spatial localization information-elements that current models have yet to fully exploit or integrate.
中文摘要:近期脑视觉解码方法虽能从神经活动中重建视觉刺激,但存在语义错位和失真问题;为此开发了通过交叉注意力整合fMRI数据的实验框架,证明其能利用丰富的语义与空间线索提升物体检测与分割的准确性。
English Summary: Recent brain-vision decoding methods have progressed in reconstructing visual stimuli from neural activity but suffer from semantic misalignment and distortion, prompting the development of a framework that integrates fMRI data via cross-attention to improve object detection and segmentation by leveraging its rich semantic and spatial cues.
Authors:Bappaditya Dey, Daniel Sorensen, Minjin Hwang, Sandip Halder
Abstract:
Semiconductor manufacturing is an extremely complex process, characterized by thousands of interdependent parameters collected across diverse tools and process steps. Multi-variate time-series (MTS) analysis has emerged as a critical methodology for enabling real-time monitoring, fault detection, and predictive maintenance in such environments. However, anomaly prediction in semiconductor fabrication presents several critical challenges, including high data dimensionality, severe class imbalance due to the rarity of true faults, noisy and missing measurements, and non-stationary behavior of production systems. Furthermore, the complex interdependencies between variables and the delayed emergence of faults across downstream stages complicate both anomaly detection and root-cause-analysis. This paper presents a novel and generic approach for anomaly detection in MTS data using machine learning. The proposed methodology consists of three main steps: a) converting MTS data into image-based representations using the Continuous Wavelet Transform, b) developing a multi-class image classifier by fine-tuning a pretrained VGG-16 architecture on custom CWT image datasets, and c) constructing a Siamese network composed of two identical sub-networks, each utilizing the fine-tuned VGG-16 as a backbone. The network takes pairs of CWT images as input -one serving as a reference or anchor (representing a known-good signal), and the other as a query (representing an unknown signal). The model then compares the embeddings of both inputs to determine whether they belong to the same class at a given time step. Our approach demonstrates high accuracy in identifying anomalies on a real FAB process time-series dataset, offering a promising solution for offline anomaly detection in process and tool trace data. Moreover, the approach is flexible and can be applied in both supervised and semi-supervised settings.
Chinese: 本文提出了一种新颖的机器学习方法,通过连续小波变换将多元时间序列数据转换为图像表示,并采用基于微调VGG-16架构的孪生网络,通过图像比对实现半导体制造过程中的异常检测。
English: This paper introduces a novel machine learning approach for anomaly detection in semiconductor manufacturing by converting multivariate time-series data into image representations using Continuous Wavelet Transform and employing a Siamese network with fine-tuned VGG-16 architecture to accurately identify faults through image comparison.
Authors:Yukang Cao, Chenyang Si, Jinghao Wang, Ziwei Liu
Abstract:
We present FreeMorph, the first tuning-free method for image morphing that accommodates inputs with different semantics or layouts. Unlike existing methods that rely on finetuning pre-trained diffusion models and are limited by time constraints and semantic/layout discrepancies, FreeMorph delivers high-fidelity image morphing without requiring per-instance training. Despite their efficiency and potential, tuning-free methods face challenges in maintaining high-quality results due to the non-linear nature of the multi-step denoising process and biases inherited from the pre-trained diffusion model. In this paper, we introduce FreeMorph to address these challenges by integrating two key innovations. 1) We first propose a guidance-aware spherical interpolation design that incorporates explicit guidance from the input images by modifying the self-attention modules, thereby addressing identity loss and ensuring directional transitions throughout the generated sequence. 2) We further introduce a step-oriented variation trend that blends self-attention modules derived from each input image to achieve controlled and consistent transitions that respect both inputs. Our extensive evaluations demonstrate that FreeMorph outperforms existing methods, being 10x ~ 50x faster and establishing a new state-of-the-art for image morphing.
中文: FreeMorph是首个无需调优的图像变形方法,能处理不同语义或布局的输入,无需逐实例训练即可实现高保真效果,速度比现有方法快10至50倍。
English: FreeMorph is the first tuning-free image morphing method that handles inputs with different semantics or layouts, achieving high-fidelity results 10x to 50x faster than existing methods without per-instance training.
Authors:Tong Liu, Yinuo Wang, Xujie Song, Wenjun Zou, Liangfa Chen, Likun Wang, Bin Shuai, Jingliang Duan, Shengbo Eben Li
Abstract:
Reinforcement learning has been proven to be highly effective in handling complex control tasks. Traditional methods typically use unimodal distributions, such as Gaussian distributions, to model the output of value distributions. However, unimodal distribution often and easily causes bias in value function estimation, leading to poor algorithm performance. This paper proposes a distributional reinforcement learning algorithm called DSAC-D (Distributed Soft Actor Critic with Diffusion Policy) to address the challenges of estimating bias in value functions and obtaining multimodal policy representations. A multimodal distributional policy iteration framework that can converge to the optimal policy was established by introducing policy entropy and value distribution function. A diffusion value network that can accurately characterize the distribution of multi peaks was constructed by generating a set of reward samples through reverse sampling using a diffusion model. Based on this, a distributional reinforcement learning algorithm with dual diffusion of the value network and the policy network was derived. MuJoCo testing tasks demonstrate that the proposed algorithm not only learns multimodal policy, but also achieves state-of-the-art (SOTA) performance in all 9 control tasks, with significant suppression of estimation bias and total average return improvement of over 10% compared to existing mainstream algorithms. The results of real vehicle testing show that DSAC-D can accurately characterize the multimodal distribution of different driving styles, and the diffusion policy network can characterize multimodal trajectories.
中文摘要:本文提出DSAC-D算法,通过扩散模型精确估计多峰价值分布并抑制估计偏差,在控制任务中实现最佳性能,平均回报提升超10%。
English Summary: This paper introduces DSAC-D, a distributional reinforcement learning algorithm that utilizes diffusion models to accurately estimate multimodal value distributions and suppress estimation bias, achieving state-of-the-art performance in control tasks with over 10% average return improvement.
Authors:Simon ReiÃ, Zdravko Marinov, Alexander Jaus, Constantin Seibold, M. Saquib Sarfraz, Erik Rodner, Rainer Stiefelhagen
Abstract:
In this paper, we explore the potential of visual in-context learning to enable a single model to handle multiple tasks and adapt to new tasks during test time without re-training. Unlike previous approaches, our focus is on training in-context learners to adapt to sequences of tasks, rather than individual tasks. Our goal is to solve complex tasks that involve multiple intermediate steps using a single model, allowing users to define entire vision pipelines flexibly at test time. To achieve this, we first examine the properties and limitations of visual in-context learning architectures, with a particular focus on the role of codebooks. We then introduce a novel method for training in-context learners using a synthetic compositional task generation engine. This engine bootstraps task sequences from arbitrary segmentation datasets, enabling the training of visual in-context learners for compositional tasks. Additionally, we investigate different masking-based training objectives to gather insights into how to train models better for solving complex, compositional tasks. Our exploration not only provides important insights especially for multi-modal medical task sequences but also highlights challenges that need to be addressed.
中文: 本文提出了一种利用视觉上下文学习和合成任务生成引擎的新方法,使单一模型无需重新训练即可处理序列化和组合式视觉任务,尤其适用于多模态医疗应用场景。
English: This paper introduces a novel method using visual in-context learning and a synthetic task generation engine to enable a single model to handle sequential and compositional vision tasks without retraining, with particular relevance to multi-modal medical applications.
Authors:Mingkai Deng, Jinyu Hou, Yilin Shen, Hongxia Jin, Graham Neubig, Zhiting Hu, Eric Xing
Abstract:
AI agents built on large language models (LLMs) hold enormous promise, but current practice focuses on a one-task-one-agent approach, which not only falls short of scalability and generality, but also suffers from the fundamental limitations of autoregressive LLMs. On the other hand, humans are general agents who reason by mentally simulating the outcomes of their actions and plans. Moving towards a more general and powerful AI agent, we introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning. Based on a principled formulation of optimal agent in any environment, \modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation. The generalized world model is implemented using LLM, which can flexibly plan in a wide range of environments using the concept-rich latent space of natural language. Experiments on difficult web browsing tasks show that \modelname improves the success of flight search from 0\% to 32.2\%. World-model-based planning, in particular, shows consistent advantage of up to 124\% over autoregressive planning, demonstrating the advantage of world model simulation as a reasoning paradigm. We are excited about the possibility for training a single, general agent model based on LLMs that can act superintelligently in all environments. To start, we make SimuRA, a web-browsing agent built on \modelname with pretrained LLMs, available as a research demo for public testing.
Chinese: SimuRA提出了一种面向目标的架构,通过使用世界模型进行基于模拟的规划,克服了自回归推理的局限性,在航班搜索等任务中将成功率从0%显著提升至32.2%。
English: SimuRA introduces a goal-oriented architecture that overcomes the limitations of autoregressive reasoning by using a world model for simulation-based planning, significantly improving performance in tasks like flight search from 0% to 32.2% success rate.
Authors:Lijia Liu, Takumi Kondo, Kyohei Atarashi, Koh Takeuchi, Jiyi Li, Shigeru Saito, Hisashi Kashima
Abstract:
This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.
中文摘要:本文提出将标准评估与反事实评估相结合的防御框架,用于检测针对基于大语言模型的评估系统的盲攻击,在保持性能的同时显著提升了系统安全性。
English Summary: This paper introduces a defense framework combining Standard Evaluation with Counterfactual Evaluation to detect blind attacks on LLM-based evaluation systems, significantly improving security while maintaining performance.
Authors:Trever Schirmer, David Bermbach
Abstract:
Most cloud platforms have a Function-as-a-Service (FaaS) offering that enables users to easily write highly scalable applications. To better understand how the platform's architecture impacts its performance, we present a research-focused testbed that can be adapted to quickly evaluate the impact of different architectures and technologies on the characteristics of scalability-focused FaaS platforms.
Chinese: 本研究提出了一种研究测试平台,用于评估不同架构和技术对函数即服务平台可扩展性的影响,以更好地理解其性能表现。
English: This study introduces a research testbed designed to evaluate how various architectures and technologies affect the scalability of Function-as-a-Service platforms, helping to understand their performance impacts.
Authors:Hongpei Li, Hui Yuan, Han Zhang, Jianghao Lin, Dongdong Ge, Mengdi Wang, Yinyu Ye
Abstract:
Mixed-Integer Linear Programming (MILP) is a foundational tool for complex decision-making problems. However, the NP-hard nature of MILP presents a significant computational challenge, motivating the development of machine learning-based heuristic solutions to accelerate downstream solvers. While recent generative models have shown promise in learning powerful heuristics, they suffer from a critical limitation. That is, they model the distribution of only the integer variables and fail to capture the intricate coupling between integer and continuous variables, creating an information bottleneck and ultimately leading to suboptimal solutions. To this end, we propose Joint Continuous-Integer Flow for Mixed-Integer Linear Programming (FMIP), which is the first generative framework that models the joint distribution of both integer and continuous variables for MILP solutions. Built upon the joint modeling paradigm, a holistic guidance mechanism is designed to steer the generative trajectory, actively refining solutions toward optimality and feasibility during the inference process. Extensive experiments on eight standard MILP benchmarks demonstrate the superior performance of FMIP against existing baselines, reducing the primal gap by 41.34% on average. Moreover, we show that FMIP is fully compatible with arbitrary backbone networks and various downstream solvers, making it well-suited for a broad range of real-world MILP applications.
中文: 提出的FMIP框架是首个同时建模混合整数规划中整数与连续变量联合分布的生成式模型,通过主动引导生成过程显著减小最优性差距,同时保持与各类求解器和网络架构的兼容性。
English: The proposed FMIP framework is the first generative model to jointly capture the distribution of both integer and continuous variables in MILP problems, actively guiding solution generation to significantly reduce optimality gaps while maintaining compatibility with various solvers and networks.
Authors:Jincheng Cao, Ruichen Jiang, Erfan Yazdandoost Hamedani, Aryan Mokhtari
Abstract:
In this paper, we study the problem of solving a simple bilevel optimization problem, where the upper-level objective is minimized over the solution set of the lower-level problem. We focus on the general setting in which both the upper- and lower-level objectives are smooth but potentially nonconvex. Due to the absence of additional structural assumptions for the lower-level objective-such as convexity or the Polyak-Åojasiewicz (PL) condition-guaranteeing global optimality is generally intractable. Instead, we introduce a suitable notion of stationarity for this class of problems and aim to design a first-order algorithm that finds such stationary points in polynomial time. Intuitively, stationarity in this setting means the upper-level objective cannot be substantially improved locally without causing a larger deterioration in the lower-level objective. To this end, we show that a simple and implementable variant of the dynamic barrier gradient descent (DBGD) framework can effectively solve the considered nonconvex simple bilevel problems up to stationarity. Specifically, to reach an $(ε_f, ε_g)$-stationary point-where $ε_f$ and $ε_g$ denote the target stationarity accuracies for the upper- and lower-level objectives, respectively-the considered method achieves a complexity of $\mathcal{O}\left(\max\left(ε_f^{-\frac{3+p}{1+p}}, ε_g^{-\frac{3+p}{2}}\right)\right)$, where $p \geq 0$ is an arbitrary constant balancing the terms. To the best of our knowledge, this is the first complexity result for a discrete-time algorithm that guarantees joint stationarity for both levels in general nonconvex simple bilevel problems.
本文针对非凸简单双层优化问题提出了一种一阶算法,无需下层目标函数的凸性假设即可在多项式时间内有效找到平稳点。
This paper introduces a first-order algorithm that efficiently finds stationary points for nonconvex simple bilevel optimization problems, achieving polynomial-time complexity without requiring convexity assumptions at the lower level.
Authors:Zhuocheng Liu, Zhishu Shen, Qiushi Zheng, Tiehua Zhang, Zheng Lei, Jiong Jin
Abstract:
Low Earth Orbit (LEO) satellites are emerging as key components of 6G networks, with many already deployed to support large-scale Earth observation and sensing related tasks. Federated Learning (FL) presents a promising paradigm for enabling distributed intelligence in these resource-constrained and dynamic environments. However, achieving reliable convergence, while minimizing both processing time and energy consumption, remains a substantial challenge, particularly in heterogeneous and partially unlabeled satellite networks. To address this challenge, we propose a novel semi-supervised federated learning framework tailored for LEO satellite networks with hierarchical clustering aggregation. To further reduce communication overhead, we integrate sparsification and adaptive weight quantization techniques. In addition, we divide the FL clustering into two stages: satellite cluster aggregation stage and Ground Stations (GSs) aggregation stage. The supervised learning at GSs guides selected Parameter Server (PS) satellites, which in turn support fully unlabeled satellites during the federated training process. Extensive experiments conducted on a satellite network testbed demonstrate that our proposal can significantly reduce processing time (up to 3x) and energy consumption (up to 4x) compared to other comparative methods while maintaining model accuracy.
Chinese: 针对低轨卫星网络提出了一种具有分层聚类和通信优化技术的新型半监督联邦学习框架,在保持模型精度的同时,处理速度提升高达3倍,能耗降低达4倍。
English: A novel semi-supervised federated learning framework with hierarchical clustering and communication optimization techniques is proposed for LEO satellite networks, achieving up to 3x faster processing and 4x lower energy consumption while maintaining model accuracy.
Authors:Shuqing Li, Qiang Chen, Xiaoxue Ren, Michael R. Lyu
Abstract:
Physics Engines (PEs) are fundamental software frameworks that simulate physical interactions in applications ranging from entertainment to safety-critical systems. Despite their importance, PEs suffer from physics failures, deviations from expected physical behaviors that can compromise software reliability, degrade user experience, and potentially cause critical failures in autonomous vehicles or medical robotics. Current testing approaches for PE-based software are inadequate, typically requiring white-box access and focusing on crash detection rather than semantically complex physics failures. This paper presents the first large-scale empirical study characterizing physics failures in PE-based software. We investigate three research questions addressing the manifestations of physics failures, the effectiveness of detection techniques, and developer perceptions of current detection practices. Our contributions include: (1) a taxonomy of physics failure manifestations; (2) a comprehensive evaluation of detection methods including deep learning, prompt-based techniques, and large multimodal models; and (3) actionable insights from developer experiences for improving detection approaches. To support future research, we release PhysiXFails, code, and other materials at https://sites.google.com/view/physics-failure-detection.
中文: 本文首次对物理引擎中的物理故障进行大规模实证研究,提出了故障表现分类法,评估了包括深度学习和多模态模型在内的检测方法,并通过开发者经验提供改进可靠性的实用建议,同时公开了所有研究资源以供后续探索。
English: This paper presents the first large-scale empirical study on physics failures in Physics Engines, introducing a taxonomy of failure manifestations, evaluating detection methods including deep learning and multimodal models, and providing developer insights to improve reliability, with all resources released for future research.
Authors:Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, Jie Jiang
Abstract:
Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
Chinese: 强化学习有效提升了离散自回归模型的生成质量,通过X-Omni框架实现了图像与语言生成的无缝融合,该框架以7B语言模型在图像生成任务中达到顶尖水平,兼具高美学品质和强大的指令遵循能力。
English: Reinforcement learning effectively enhances the generation quality of a discrete autoregressive model, enabling seamless integration of image and language generation through the X-Omni framework, which achieves state-of-the-art performance with high aesthetic quality and strong instruction-following capabilities.
Authors:Tesshu Hanaka, Michael Lampis, Nikolaos Melissinos, Edouard Nemery, Hirotaka Ono, Manolis Vasilakis
Abstract:
We consider the \textsc{Steiner Orientation} problem, where we are given as input a mixed graph $G=(V,E,A)$ and a set of $k$ demand pairs $(s_i,t_i)$, $i\in[k]$. The goal is to orient the undirected edges of $G$ in a way that the resulting directed graph has a directed path from $s_i$ to $t_i$ for all $i\in[k]$. We adopt the point of view of structural parameterized complexity and investigate the complexity of \textsc{Steiner Orientation} for standard measures, such as treewidth. Our results indicate that \textsc{Steiner Orientation} is a surprisingly hard problem from this point of view. In particular, our main contributions are the following: (1) We show that \textsc{Steiner Orientation} is NP-complete on instances where the underlying graph has feedback vertex number 2, treewidth 2, pathwidth 3, and vertex integrity 6; (2) We present an XP algorithm parameterized by vertex cover number $\mathrm{vc}$ of complexity $n^{\mathcal{O}(\mathrm{vc}^2)}$. Furthermore, we show that this running time is essentially optimal by proving that a running time of $n^{o(\mathrm{vc}^2)}$ would refute the ETH; (3) We consider parameterizations by the number of undirected or directed edges ($|E|$ or $|A|$) and we observe that the trivial $2^{|E|}n^{\mathcal{O}(1)}$-time algorithm for the former parameter is optimal under the SETH. Complementing this, we show that the problem admits a $2^{\mathcal{O}(|A|)}n^{\mathcal{O}(1)}$-time algorithm. In addition to the above, we consider the complexity of \textsc{Steiner Orientation} parameterized by $\mathrm{tw}+k$ (FPT), distance to clique (FPT), and $\mathrm{vc}+k$ (FPT with a polynomial kernel).
中文摘要:研究表明,即使在树宽为2等低结构参数下,斯坦纳定向问题仍是NP完全问题,同时针对其他参数化提出了具有紧复杂度界的XP和FPT算法。
English Summary: The study demonstrates that the Steiner Orientation problem is NP-complete even under low structural parameters like treewidth 2, while presenting XP and FPT algorithms with tight complexity bounds for other parameterizations.
Authors:Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela
Abstract:
Contrastive Decoding (CD) has emerged as an effective inference-time strategy for enhancing open-ended text generation by exploiting the divergence in output probabilities between a large expert language model and a smaller amateur model. Although CD improves coherence and fluency, its dependence on a single amateur restricts its capacity to capture the diverse and multifaceted failure modes of language generation, such as repetition, hallucination, and stylistic drift. This paper proposes Multi-Amateur Contrastive Decoding (MACD), a generalization of the CD framework that employs an ensemble of amateur models to more comprehensively characterize undesirable generation patterns. MACD integrates contrastive signals through both averaging and consensus penalization mechanisms and extends the plausibility constraint to operate effectively in the multi-amateur setting. Furthermore, the framework enables controllable generation by incorporating amateurs with targeted stylistic or content biases. Experimental results across multiple domains, such as news, encyclopedic, and narrative, demonstrate that MACD consistently surpasses conventional decoding methods and the original CD approach in terms of fluency, coherence, diversity, and adaptability, all without requiring additional training or fine-tuning.
中文摘要:对比解码通过引入多业余模型对比解码(MACD),利用多个业余模型更全面地识别生成错误,无需额外训练即可在多个领域提升文本的流畅性、连贯性和多样性。
English Summary: Contrastive Decoding is enhanced by Multi-Amateur Contrastive Decoding (MACD), which uses multiple amateur models to better identify and avoid diverse generation errors, improving text quality across various domains without extra training.
Authors:Hongzhen Zhang, Tim Kerkenhoff, Neil Kichler, Manuel Dahmen, Alexander Mitsos, Uwe Naumann, Dominik Bongartz
Abstract:
Spatial Branch and Bound (B&B) algorithms are widely used for solving nonconvex problems to global optimality, yet they remain computationally expensive. Though some works have been carried out to speed up B&B via CPU parallelization, GPU parallelization is much less explored. In this work, we investigate the design of a spatial B&B algorithm that involves an interval-based GPU-parallel lower bounding solver: The domain of each B&B node is temporarily partitioned into numerous subdomains, then massive GPU parallelism is leveraged to compute interval bounds of the objective function and constraints on each subdomain, using the Mean Value Form. The resulting bounds are tighter than those achieved via regular interval arithmetic without partitioning, but they remain fast to compute. We implement the method into our open-source solver MAiNGO via CUDA in two manners: wrapping all GPU tasks within one kernel function, or distributing the GPU tasks onto a CUDA graph. Numerical experiments show that using more subdomains leads to significantly tighter lower bounds and thus less B&B iterations. Regarding wall clock time, the proposed spatial B&B framework achieves a speedup of three orders of magnitude compared to applying interval arithmetic on the CPU without domain partitioning. Among the two implementations, the one developed with CUDA graph enables higher efficiency. Moreover, in some case studies, the proposed method delivers competitive or better performance compared to MAiNGO's default solver which is based on McCormick relaxations. These results highlight the potential of GPU-accelerated bounding techniques to accelerate B&B algorithms.
中文: 本研究提出了一种基于GPU并行的空间分支定界算法,通过区间划分计算更紧密的边界,在保持与现有求解器相当性能的同时,相比CPU方法实现了千倍的加速效果。
English: This study introduces a GPU-parallel spatial branch and bound algorithm that uses interval-based partitioning to compute tighter bounds efficiently, achieving a thousandfold speedup over CPU methods while maintaining competitive performance with existing solvers.
Authors:Avi Deb Raha, Apurba Adhikary, Mrityunjoy Gain, Yumin Park, Walid Saad, Choong Seon Hong
Abstract:
Deep Joint Source-Channel Coding (Deep-JSCC) has emerged as a promising semantic communication approach for wireless image transmission by jointly optimizing source and channel coding using deep learning techniques. However, traditional Deep-JSCC architectures employ fixed encoder-decoder structures, limiting their adaptability to varying device capabilities, real-time performance optimization, power constraints and channel conditions. To address these limitations, we propose DD-JSCC: Dynamic Deep Joint Source-Channel Coding for Semantic Communications, a novel encoder-decoder architecture designed for semantic communication systems. Unlike traditional Deep-JSCC models, DD-JSCC is flexible for dynamically adjusting its layer structures in real-time based on transmitter and receiver capabilities, power constraints, compression ratios, and current channel conditions. This adaptability is achieved through a hierarchical layer activation mechanism combined with implicit regularization via sequential randomized training, effectively reducing combinatorial complexity, preventing overfitting, and ensuring consistent feature representations across varying configurations. Simulation results demonstrate that DD-JSCC enhances the performance of image reconstruction in semantic communications, achieving up to 2 dB improvement in Peak Signal-to-Noise Ratio (PSNR) over fixed Deep-JSCC architectures, while reducing training costs by over 40%. The proposed unified framework eliminates the need for multiple specialized models, significantly reducing training complexity and deployment overhead.
中文: DD-JSCC提出了一种动态编码器-解码器架构,能根据设备能力和信道条件实时调整结构,在语义通信中使图像重建性能提升最高2 dB PSNR,同时比固定架构降低40%以上训练成本。
English: DD-JSCC introduces a dynamic encoder-decoder architecture that adapts in real-time to device capabilities and channel conditions, improving image transmission performance by up to 2 dB PSNR while cutting training costs by over 40% compared to fixed Deep-JSCC systems.
Authors:Aishwarya Mandyam, Jason Meng, Ge Gao, Jiankai Sun, Mac Schwager, Barbara E. Engelhardt, Emma Brunskill
Abstract:
Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of these value estimates. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation for OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for comparing policy value estimates. In this work, we propose two approaches to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over the policy performance conditioned on a particular initial state $V^Ï(s_0)$-- such intervals are particularly important for human-centered applications. To do so we introduce a new conformal prediction method for high dimensional state MDPs. Second, we consider the more common task of estimating the average policy performance over many initial states; to do so we draw on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning robotics, healthcare and inventory management, and a real healthcare dataset from MIMIC-IV, we find that our methods can use augmented data and still consistently produce intervals that cover the ground truth values, unlike previously proposed methods.
中文摘要:本研究提出了两种新方法,利用增强数据构建可靠的置信区间进行离线策略评估,确保在医疗和机器人等不同应用场景中实现精确的不确定性量化。
English Summary: This study introduces two novel methods for constructing reliable confidence intervals in off-policy evaluation using augmented data, ensuring accurate uncertainty quantification across various applications including healthcare and robotics.
Authors:Sheethal Bhat, Bogdan Georgescu, Adarsh Bhandary Panambur, Mathias Zinnen, Tri-Thien Nguyen, Awais Mansoor, Karim Khalifa Elbarbary, Siming Bayer, Florin-Cristian Ghesu, Sasa Grbic, Andreas Maier
Abstract:
Detecting abnormalities in medical images poses unique challenges due to differences in feature representations and the intricate relationship between anatomical structures and abnormalities. This is especially evident in mammography, where dense breast tissue can obscure lesions, complicating radiological interpretation. Despite leveraging anatomical and semantic context, existing detection methods struggle to learn effective class-specific features, limiting their applicability across different tasks and imaging modalities. In this work, we introduce Exemplar Med-DETR, a novel multi-modal contrastive detector that enables feature-based detection. It employs cross-attention with inherently derived, intuitive class-specific exemplar features and is trained with an iterative strategy. We achieve state-of-the-art performance across three distinct imaging modalities from four public datasets. On Vietnamese dense breast mammograms, we attain an mAP of 0.7 for mass detection and 0.55 for calcifications, yielding an absolute improvement of 16 percentage points. Additionally, a radiologist-supported evaluation of 100 mammograms from an out-of-distribution Chinese cohort demonstrates a twofold gain in lesion detection performance. For chest X-rays and angiography, we achieve an mAP of 0.25 for mass and 0.37 for stenosis detection, improving results by 4 and 7 percentage points, respectively. These results highlight the potential of our approach to advance robust and generalizable detection systems for medical imaging.
中文: 该研究提出的Exemplar Med-DETR多模态对比检测器,通过采用类别特异性范例特征和迭代训练策略,在多种医学影像任务中实现最优性能,显著提升了乳腺X光、胸部X光和血管造影的病灶检测效果。
English: The study introduces Exemplar Med-DETR, a multi-modal contrastive detector that achieves state-of-the-art performance across diverse medical imaging tasks, significantly improving lesion detection in mammograms, chest X-rays, and angiography through its innovative use of class-specific exemplar features and iterative training.
Authors:Viktar Dubovik, Åukasz Struski, Jacek Tabor, Dawid Rymarczyk
Abstract:
Understanding the decisions made by deep neural networks is essential in high-stakes domains such as medical imaging and autonomous driving. Yet, these models often lack transparency, particularly in computer vision. Prototypical-parts-based neural networks have emerged as a promising solution by offering concept-level explanations. However, most are limited to fine-grained classification tasks, with few exceptions such as InfoDisent. InfoDisent extends prototypical models to large-scale datasets like ImageNet, but produces complex explanations.
We introduce Sparse Information Disentanglement for Explainability (SIDE), a novel method that improves the interpretability of prototypical parts through a dedicated training and pruning scheme that enforces sparsity. Combined with sigmoid activations in place of softmax, this approach allows SIDE to associate each class with only a small set of relevant prototypes. Extensive experiments show that SIDE matches the accuracy of existing methods while reducing explanation size by over $90\%$, substantially enhancing the understandability of prototype-based explanations.
中文: SIDE是一种新颖方法,通过强制稀疏性和使用Sigmoid激活函数,显著提升了原型神经网络的解释性,在保持准确性的同时将解释规模减少超过90%。
English: SIDE is a novel method that enhances the interpretability of prototypical neural networks by enforcing sparsity and using sigmoid activations, reducing explanation size by over 90% while maintaining accuracy.
Authors:Monika WysoczaÅska, Shyamal Buch, Anurag Arnab, Cordelia Schmid
Abstract:
Large vision-language models (VLMs) often struggle to generate long and factual captions. However, traditional measures for hallucination and factuality are not well suited for evaluating longer, more diverse captions and in settings where ground-truth human-annotated captions are unavailable. We introduce OV-Fact, a novel method for measuring caption factuality of long captions that leverages open-vocabulary visual grounding and tool-based verification without depending on human annotations. Our method improves agreement with human judgments and captures both caption descriptiveness (recall) and factual precision in the same metric. Furthermore, unlike previous metrics, our reference-free method design enables new applications towards factuality-based data filtering. We observe models trained on an OVFact-filtered (2.5-5x less) subset of a large-scale, noisy (VLM-generated) pretraining set meaningfully improve factuality precision without sacrificing caption descriptiveness across a range of downstream long caption benchmarks.
中文摘要:OV-Fact方法提出了一种无需人工标注的开放式词汇视觉定位技术,通过工具验证评估长描述的事实准确性,不仅能更好匹配人类判断,还能通过数据筛选显著提升模型事实精确度而不损失描述丰富性。
English Summary: The OV-Fact method introduces a reference-free approach for evaluating long caption factuality using open-vocabulary visual grounding, which improves human judgment alignment and enables effective data filtering to enhance model precision without compromising descriptiveness.
Authors:Santiago Berrezueta-Guzman, WenChun Chen, Stefan Wagner
Abstract:
Virtual Reality (VR) offers promising avenues for innovative therapeutic interventions in populations with intellectual disabilities (ID). This paper presents the design, development, and evaluation of Space Exodus, a novel VR-based role-playing game specifically tailored for children with ID. By integrating immersive gameplay with therapeutic task design, Space Exodus aims to enhance concentration, cognitive processing, and fine motor skills through structured hand-eye coordination exercises. A six-week pre-test/post-test study was conducted with 16 children in Ecuador, using standardized assessments, the Toulouse-Pieron Cancellation Test, and the Moss Attention Rating Scale complemented by detailed observational metrics. Quantitative results indicate statistically significant improvements in concentration scores, with test scores increasing from 65.2 to 80.3 and 55.4 to 68.7, respectively (p < 0.01). Qualitative observations revealed reduced task attempts, enhanced user confidence, and increased active participation. The inclusion of a VR assistant provided consistent guidance that further boosted engagement. These findings demonstrate the potential of immersive, game-based learning environments as practical therapeutic tools, laying a robust foundation for developing inclusive and adaptive rehabilitation strategies for children with ID.
中文: 本文介绍了专为智障儿童设计的虚拟现实角色扮演游戏《太空出埃及记》,通过六周研究表明该游戏显著提升了儿童的专注力和认知能力,验证了沉浸式游戏作为有效治疗工具的潜力。
English: This paper introduces Space Exodus, a VR role-playing game designed for children with intellectual disabilities, demonstrating through a six-week study significant improvements in concentration and cognitive skills, thereby supporting the use of immersive games as effective therapeutic tools.
Authors:Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, Eng Siong Chng
Abstract:
Full-duplex spoken dialogue systems (FDSDS) enable more natural human-machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizing LLMs, TTS, and ASR to address this gap. It assesses FDSDS's ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with diverse novel metrics. We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours of generated speech, with 293 simulated conversations and 1,200 interruptions. The results show that all models continue to face challenges, such as failing to respond to user interruptions, under frequent disruptions and noisy conditions. Demonstrations, data, and code will be released.
中文: 本文提出了一个全面的全双工基准测试流程,用于评估语音对话系统处理用户打断和保持鲁棒性的能力,结果表明现有模型在频繁干扰的模拟对话中仍面临持续挑战。
English: This paper introduces a comprehensive full-duplex benchmarking pipeline that evaluates spoken dialogue systems' ability to handle user interruptions and maintain robustness, revealing persistent challenges in existing models during simulated conversations with frequent disruptions.
Authors:Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, Lu Jiang
Abstract:
We present Captain Cinema, a generation framework for short movie generation. Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensures long-range coherence in both the storyline and visual appearance (e.g., scenes and characters). We refer to this step as top-down keyframe planning. These keyframes then serve as conditioning signals for a video synthesis model, which supports long context learning, to produce the spatio-temporal dynamics between them. This step is referred to as bottom-up video synthesis. To support stable and efficient generation of multi-scene long narrative cinematic works, we introduce an interleaved training strategy for Multimodal Diffusion Transformers (MM-DiT), specifically adapted for long-context video data. Our model is trained on a specially curated cinematic dataset consisting of interleaved data pairs. Our experiments demonstrate that Captain Cinema performs favorably in the automated creation of visually coherent and narrative consistent short movies in high quality and efficiency. Project page: https://thecinema.ai
Chinese: Captain Cinema 是一个通过先规划关键帧确保叙事连贯性,再结合多模态扩散模型合成视频的框架,能够高效生成高质量且情节一致的短片电影。
English: Captain Cinema is a framework that generates short movies by first planning keyframes for narrative coherence and then synthesizing videos using a multimodal diffusion model, producing high-quality, consistent cinematic works efficiently.
Authors:Michael Konstantinou, Renzo Degiovanni, Jie M. Zhang, Mark Harman, Mike Papadakis
Abstract:
Recent advances in automated test generation utilises language models to produce unit tests. While effective, language models tend to generate many incorrect tests with respect to both syntax and semantics. Although such incorrect tests can be easily detected and discarded, they constitute a "missed opportunity" -- if fixed, they are often valuable as they directly add testing value (they effectively target the underlying program logic to be tested) and indirectly form good seeds for generating additional tests. To this end, we propose a simple technique for repairing some of these incorrect tests through a combination of rule-based static analysis and re-prompting. We evaluate this simple approach, named YATE, on a set of 6 open-source projects and show that it can effectively produce tests that cover on average 32.06% more lines and kill 21.77% more mutants than a plain LLM-based method. We also compare YATE with four other LLM-based methods, namely HITS, SYMPROMPT, TESTSPARK and COVERUP and show that it produces tests that cover substantially more code. YATE achieves 22% higher line coverage, 20% higher branch coverage and kill 20% more mutants at a comparable cost (number of calls to LLMs).
Chinese: 近期自动化测试生成利用语言模型取得进展,但常产生错误测试;YATE技术结合静态分析和重新提示有效修复这些测试,相比其他方法显著提高了代码覆盖率和变异体杀死率。
English: Recent advances use language models for automated test generation, but they often produce incorrect tests; YATE, a new technique combining static analysis and re-prompting, effectively repairs these tests, achieving significantly higher code coverage and mutant killing rates than other methods.
Authors:Zhongzheng Yuan, Lianshuai Guo, Xunkai Li, Yinlin Zhu, Wenyu Wang, Meixia Qu
Abstract:
Federated Graph Learning (FGL) is a distributed learning paradigm that enables collaborative training over large-scale subgraphs located on multiple local systems. However, most existing FGL approaches rely on synchronous communication, which leads to inefficiencies and is often impractical in real-world deployments. Meanwhile, current asynchronous federated learning (AFL) methods are primarily designed for conventional tasks such as image classification and natural language processing, without accounting for the unique topological properties of graph data. Directly applying these methods to graph learning can possibly result in semantic drift and representational inconsistency in the global model. To address these challenges, we propose FedSA-GCL, a semi-asynchronous federated framework that leverages both inter-client label distribution divergence and graph topological characteristics through a novel ClusterCast mechanism for efficient training. We evaluate FedSA-GCL on multiple real-world graph datasets using the Louvain and Metis split algorithms, and compare it against 9 baselines. Extensive experiments demonstrate that our method achieves strong robustness and outstanding efficiency, outperforming the baselines by an average of 2.92% with the Louvain and by 3.4% with the Metis.
中文:FedSA-GCL作为一种半异步联邦框架,通过ClusterCast机制利用客户端间标签分布差异和图拓扑特性,有效解决了同步通信低效和图学习中语义漂移问题,在真实图数据集上展现出超越基线方法的鲁棒性和效率优势。
English: FedSA-GCL is a semi-asynchronous federated framework that addresses inefficiencies in synchronous communication and semantic drift in graph learning by leveraging inter-client label divergence and graph topology through its ClusterCast mechanism, achieving superior robustness and efficiency over baselines on real-world datasets.
Authors:Jonas Elmiger, Gian Marti, Christoph Studer
Abstract:
Modern positioning relies on radio signals from global navigation satellite systems (GNSS). Their low receive power renders these radio signals susceptible to jamming attacks, in which malicious transmitters emit strong interference to disrupt signal acquisition. Moreover, GNSS are vulnerable to spoofing attacks, in which malicious transmitters mimic legitimate satellites by transmitting spurious GNSS signals. We propose SCHIEBER, a novel method for multi-antenna GNSS receivers that mitigates jammers as well as spoofers without requiring any prior knowledge of the receiver position or attack type: Jammers are mitigated during signal acquisition using a recently developed adaptive spatial filtering technique. Spoofers are identified and rejected after signal acquisition using a novel approach that tests the consistency of acquired signals by comparing their respective direction of arrival (DoA) and pseudorange estimates in a test that is invariant with respect to the unknown receiver position. We demonstrate the efficacy of our method using extensive simulations of a GPS L1 C/A system under spoofing and jamming attacks.
中文:SCHIEBER是一种新型多天线GNSS接收器方法,在信号捕获阶段通过自适应空间滤波抑制干扰,并通过比较信号到达方向和伪距来检验一致性以识别和排除欺骗攻击,无需预先了解位置或攻击类型。
English: SCHIEBER is a novel multi-antenna GNSS receiver method that mitigates jamming through adaptive spatial filtering during signal acquisition and identifies spoofers by testing signal consistency via DoA and pseudorange comparisons, without prior knowledge of position or attack type.
Authors:Ali Abedi, Sadaf Safa, Tracey J. F. Colella, Shehroz S. Khan
Abstract:
Engagement in virtual learning is essential for participant satisfaction, performance, and adherence, particularly in online education and virtual rehabilitation, where interactive communication plays a key role. Yet, accurately measuring engagement in virtual group settings remains a challenge. There is increasing interest in using artificial intelligence (AI) for large-scale, real-world, automated engagement recognition. While engagement has been widely studied in younger academic populations, research and datasets focused on older adults in virtual and telehealth learning settings remain limited. Existing methods often neglect contextual relevance and the longitudinal nature of engagement across sessions. This paper introduces OPEN (Older adult Patient ENgagement), a novel dataset supporting AI-driven engagement recognition. It was collected from eleven older adults participating in weekly virtual group learning sessions over six weeks as part of cardiac rehabilitation, producing over 35 hours of data, making it the largest dataset of its kind. To protect privacy, raw video is withheld; instead, the released data include facial, hand, and body joint landmarks, along with affective and behavioral features extracted from video. Annotations include binary engagement states, affective and behavioral labels, and context-type indicators, such as whether the instructor addressed the group or an individual. The dataset offers versions with 5-, 10-, 30-second, and variable-length samples. To demonstrate utility, multiple machine learning and deep learning models were trained, achieving engagement recognition accuracy of up to 81 percent. OPEN provides a scalable foundation for personalized engagement modeling in aging populations and contributes to broader engagement recognition research.
中文: 本文推出OPEN数据集,这是针对老年群体虚拟学习场景的最大规模参与度数据库,通过AI技术实现高达81%的识别准确率,同时解决了隐私保护和长期参与追踪的难题。
English: This paper introduces the OPEN dataset, the largest collection of engagement data from older adults in virtual group learning, enabling AI-driven recognition with up to 81% accuracy while addressing privacy and longitudinal engagement challenges.
Authors:Neil He, Hiren Madhu, Ngoc Bui, Menglin Yang, Rex Ying
Abstract:
Foundation models pre-trained on massive datasets, including large language models (LLMs), vision-language models (VLMs), and large multimodal models, have demonstrated remarkable success in diverse downstream tasks. However, recent studies have shown fundamental limitations of these models: (1) limited representational capacity, (2) lower adaptability, and (3) diminishing scalability. These shortcomings raise a critical question: is Euclidean geometry truly the optimal inductive bias for all foundation models, or could incorporating alternative geometric spaces enable models to better align with the intrinsic structure of real-world data and improve reasoning processes? Hyperbolic spaces, a class of non-Euclidean manifolds characterized by exponential volume growth with respect to distance, offer a mathematically grounded solution. These spaces enable low-distortion embeddings of hierarchical structures (e.g., trees, taxonomies) and power-law distributions with substantially fewer dimensions compared to Euclidean counterparts. Recent advances have leveraged these properties to enhance foundation models, including improving LLMs' complex reasoning ability, VLMs' zero-shot generalization, and cross-modal semantic alignment, while maintaining parameter efficiency. This paper provides a comprehensive review of hyperbolic neural networks and their recent development for foundation models. We further outline key challenges and research directions to advance the field.
中文: 基础模型在表征能力和可扩展性方面存在局限,因此研究转向双曲空间这一非欧几何方法,以更高效地嵌入层次结构并提升推理能力,同时减少维度需求。
English: Foundation models like LLMs and VLMs face limitations in representational capacity and scalability, prompting exploration of hyperbolic spaces as a non-Euclidean alternative to better embed hierarchical structures and improve reasoning with fewer dimensions.
Authors:Malsha Ashani Mahawatta Dona, Beatriz Cabrero-Daniel, Yinan Yu, Christian Berger
Abstract:
Large language models (LLMs) are growingly extended to process multimodal data such as text and video simultaneously. Their remarkable performance in understanding what is shown in images is surpassing specialized neural networks (NNs) such as Yolo that is supporting only a well-formed but very limited vocabulary, ie., objects that they are able to detect. When being non-restricted, LLMs and in particular state-of-the-art vision language models (VLMs) show impressive performance to describe even complex traffic situations. This is making them potentially suitable components for automotive perception systems to support the understanding of complex traffic situations or edge case situation. However, LLMs and VLMs are prone to hallucination, which mean to either potentially not seeing traffic agents such as vulnerable road users who are present in a situation, or to seeing traffic agents who are not there in reality. While the latter is unwanted making an ADAS or autonomous driving systems (ADS) to unnecessarily slow down, the former could lead to disastrous decisions from an ADS. In our work, we are systematically assessing the performance of 3 state-of-the-art VLMs on a diverse subset of traffic situations sampled from the Waymo Open Dataset to support safety guardrails for capturing such hallucinations in VLM-supported perception systems. We observe that both, proprietary and open VLMs exhibit remarkable image understanding capabilities even paying thorough attention to fine details sometimes difficult to spot for us humans. However, they are also still prone to making up elements in their descriptions to date requiring hallucination detection strategies such as BetterCheck that we propose in our work.
中文: 大语言模型和视觉语言模型在理解复杂交通场景方面表现出色,但容易产生幻觉,可能漏掉真实元素或虚构不存在的事物,因此需要像我们提出的BetterCheck这样的安全策略来检测这些幻觉。
English: Large language models (LLMs) and vision language models (VLMs) demonstrate strong capabilities in understanding complex traffic scenes but are prone to hallucinations, either missing actual elements or inventing non-existent ones, which necessitates safety measures like the proposed BetterCheck strategy.
Authors:Zhao Song, Song Yue, Jiahao Zhang
Abstract:
Large Reasoning Models (LRMs) have become a central focus in today's large language model (LLM) research, where models are designed to output a step-by-step thinking process before arriving at a final answer to handle complex reasoning tasks. Despite their promise, recent empirical studies (e.g., [Shojaee et al., 2025] from Apple) suggest that this thinking process may not actually enhance reasoning ability, where LLMs without explicit reasoning actually outperform LRMs on tasks with low or high complexity. In this work, we revisit these findings and investigate whether the limitations of LRMs persist when tool augmentations are introduced. We incorporate two types of tools, Python interpreters and scratchpads, and evaluate three representative LLMs and their LRM counterparts on Apple's benchmark reasoning puzzles. Our results show that, with proper tool use, LRMs consistently outperform their non-reasoning counterparts across all levels of task complexity. These findings challenge the recent narrative that reasoning is an illusion and highlight the potential of tool-augmented LRMs for solving complex problems.
中文: 通过集成Python解释器和草稿纸等工具增强的大型推理模型(LRM)在所有任务复杂度下均优于非推理大语言模型,反驳了“推理只是假象”的观点。
English: Large Reasoning Models (LRMs) with tool augmentation, such as Python interpreters and scratchpads, consistently outperform non-reasoning LLMs across all task complexities, challenging the notion that reasoning processes are illusory.
Authors:Jaiprakash Nagar, Zheng Chen, Marios Kountouris, Photios A. Stavrou
Abstract:
This paper addresses decentralized stochastic gradient descent (D-SGD) over resource-constrained networks by introducing node-based and link-based scheduling strategies to enhance communication efficiency. In each iteration of the D-SGD algorithm, only a few disjoint subsets of nodes or links are randomly activated, subject to a given communication cost constraint. We propose a novel importance metric based on information entropy to determine node and link scheduling probabilities. We validate the effectiveness of our approach through extensive simulations, comparing it against state-of-the-art methods, including betweenness centrality (BC) for node scheduling and \textit{MATCHA} for link scheduling. The results show that our method consistently outperforms the BC-based method in the node scheduling case, achieving faster convergence with up to 60\% lower communication budgets. At higher communication budgets (above 60\%), our method maintains comparable or superior performance. In the link scheduling case, our method delivers results that are superior to or on par with those of \textit{MATCHA}.
中文: 本文提出基于信息熵的节点和链路调度策略,以提升去中心化随机梯度下降的通信效率,在降低高达60%通信成本的同时实现更快收敛,并优于现有方法。
English: This paper introduces node-based and link-based scheduling strategies using information entropy to enhance communication efficiency in decentralized stochastic gradient descent, achieving faster convergence with up to 60% lower communication costs while outperforming existing methods.
Authors:Giuseppe Serra, Photios A. Stavrou, Marios Kountouris
Abstract:
In this paper, we investigate the problem of distributionally robust source coding, i.e., source coding under uncertainty in the source distribution, discussing both the coding and computational aspects of the problem. We propose two extensions of the so-called Strong Functional Representation Lemma (SFRL), considering the cases where, for a fixed conditional distribution, the marginal inducing the joint coupling belongs to either a finite set of distributions or a Kullback-Leibler divergence sphere (KL-Sphere) centered at a fixed nominal distribution. Using these extensions, we derive distributionally robust coding schemes for both the one-shot and asymptotic regimes, generalizing previous results in the literature. Focusing on the case where the source distribution belongs to a given KL-Sphere, we derive an implicit characterization of the points attaining the robust rate-distortion function (R-RDF), which we later exploit to implement a novel algorithm for computing the R-RDF. Finally, we characterize the analytical expression of the R-RDF for Bernoulli sources, providing a theoretical benchmark to evaluate the estimation performance of the proposed algorithm.
中文: 本文扩展了强函数表示引理,针对分布不确定的源编码问题提出了鲁棒编码方案,不仅推导了理论上的鲁棒率失真函数,还设计了计算该函数的创新算法并以伯努利信源为例进行了验证。
English: This paper extends the Strong Functional Representation Lemma to develop distributionally robust source coding schemes for uncertain source distributions, deriving both theoretical bounds and computational algorithms including a novel method for calculating the robust rate-distortion function.
Authors:Yu Chen, Shaoyuan Li, Xiang Yin
Abstract:
In this paper, we revisit the formal verification problem for stochastic dynamical systems over finite horizon using barrier certificates. Most existing work on this topic focuses on safety properties by constructing barrier certificates based on the notion of $c$-martingales. In this work, we first provide a new insight into the conditions of existing martingale-based barrier certificates from the perspective of dynamic programming operators. Specifically, we show that the existing conditions essentially provide a bound on the dynamic programming solution, which exactly characterizes the safety probability. Based on this new perspective, we demonstrate that the barrier conditions in existing approaches are unnecessarily conservative over unsafe states. To address this, we propose a new set of safety barrier certificate conditions that are strictly less conservative than existing ones, thereby providing tighter probability bounds for safety verification. We further extend our approach to the case of reach-avoid specifications by providing a set of new barrier certificate conditions. We also illustrate how to search for these new barrier certificates using sum-of-squares (SOS) programming. Finally, we use two numerical examples to demonstrate the advantages of our method compared to existing approaches.
本文针对随机动力系统的安全验证问题,通过动态规划视角揭示了现有屏障证书条件的保守性,提出了一套能提供更紧概率边界的创新屏障条件,并采用平方和规划实现高效求解。
This paper introduces a novel perspective on barrier certificates for stochastic systems, revealing that existing conditions are overly conservative and proposing new, less restrictive conditions that yield tighter safety probability bounds through SOS programming.
Authors:Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang
Abstract:
Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.
Chinese: ThinkAct提出了一种双系统框架,通过强化视觉潜在规划将高层推理与低层动作执行相结合,在复杂具身AI任务中实现了少样本适应、长程规划和自我纠正行为。
English: ThinkAct introduces a dual-system framework that integrates high-level reasoning with low-level action execution through reinforced visual latent planning, enabling few-shot adaptation, long-horizon planning, and self-correction in complex embodied AI tasks.
Authors:Wenbo Xu, Junyan Wu, Wei Lu, Xiangyang Luo, Qian Wang
Abstract:
Current researches on Deepfake forensics often treat detection as a classification task or temporal forgery localization problem, which are usually restrictive, time-consuming, and challenging to scale for large datasets. To resolve these issues, we present a multimodal deviation perceiving framework for weakly-supervised temporal forgery localization (MDP), which aims to identify temporal partial forged segments using only video-level annotations. The MDP proposes a novel multimodal interaction mechanism (MI) and an extensible deviation perceiving loss to perceive multimodal deviation, which achieves the refined start and end timestamps localization of forged segments. Specifically, MI introduces a temporal property preserving cross-modal attention to measure the relevance between the visual and audio modalities in the probabilistic embedding space. It could identify the inter-modality deviation and construct comprehensive video features for temporal forgery localization. To explore further temporal deviation for weakly-supervised learning, an extensible deviation perceiving loss has been proposed, aiming at enlarging the deviation of adjacent segments of the forged samples and reducing that of genuine samples. Extensive experiments demonstrate the effectiveness of the proposed framework and achieve comparable results to fully-supervised approaches in several evaluation metrics.
中文摘要:本文提出了一种用于弱监督时序伪造定位的多模态偏差感知框架(MDP),通过创新的多模态交互机制和可扩展偏差感知损失,仅使用视频级标注即可精确定位伪造片段,在多项评估指标中达到与全监督方法相当的性能。
English Summary: This paper introduces a multimodal deviation perceiving framework (MDP) for weakly-supervised temporal forgery localization, which uses novel multimodal interaction and deviation loss mechanisms to precisely identify forged video segments with only video-level annotations, achieving performance comparable to fully-supervised methods.
Authors:Riccardo Bussola, Michele Focchi, Giulio Turrisi, Claudio Semini, Luigi Palopoli
Abstract:
Jumping poses a significant challenge for quadruped robots, despite being crucial for many operational scenarios. While optimisation methods exist for controlling such motions, they are often time-consuming and demand extensive knowledge of robot and terrain parameters, making them less robust in real-world scenarios. Reinforcement learning (RL) is emerging as a viable alternative, yet conventional end-to-end approaches lack efficiency in terms of sample complexity, requiring extensive training in simulations, and predictability of the final motion, which makes it difficult to certify the safety of the final motion. To overcome these limitations, this paper introduces a novel guided reinforcement learning approach that leverages physical intuition for efficient and explainable jumping, by combining Bézier curves with a Uniformly Accelerated Rectilinear Motion (UARM) model. Extensive simulation and experimental results clearly demonstrate the advantages of our approach over existing alternatives.
中文: 本文提出了一种引导式强化学习方法,通过结合贝塞尔曲线与匀加速直线运动模型,实现了四足机器人高效且可解释的跳跃运动,有效克服了传统优化方法和端到端强化学习的局限性。
English: This paper presents a guided reinforcement learning method that integrates Bézier curves with a Uniformly Accelerated Rectilinear Motion model to enable efficient and explainable jumping motions for quadruped robots, overcoming limitations of traditional optimization and end-to-end RL approaches.
Authors:Riccardo Bussola, Michele Focchi, Giulio Turrisi, Claudio Semini, Luigi Palopoli
Abstract:
Jumping poses a significant challenge for quadruped robots, despite being crucial for many operational scenarios. While optimisation methods exist for controlling such motions, they are often time-consuming and demand extensive knowledge of robot and terrain parameters, making them less robust in real-world scenarios. Reinforcement learning (RL) is emerging as a viable alternative, yet conventional end-to-end approaches lack efficiency in terms of sample complexity, requiring extensive training in simulations, and predictability of the final motion, which makes it difficult to certify the safety of the final motion. To overcome these limitations, this paper introduces a novel guided reinforcement learning approach that leverages physical intuition for efficient and explainable jumping, by combining Bézier curves with a Uniformly Accelerated Rectilinear Motion (UARM) model. Extensive simulation and experimental results clearly demonstrate the advantages of our approach over existing alternatives.
中文: 本文提出了一种引导式强化学习方法,通过结合贝塞尔曲线与匀加速直线运动模型,实现了四足机器人高效且可解释的跳跃运动,有效克服了传统优化方法和端到端强化学习的局限性。
English: This paper presents a guided reinforcement learning method that integrates Bézier curves with a Uniformly Accelerated Rectilinear Motion model to enable efficient and explainable jumping motions for quadruped robots, overcoming limitations of traditional optimization and end-to-end RL approaches.
Authors:Ardhendu Sekhar, Vasu Soni, Keshav Aske, Garima Jain, Pranav Jeevan, Amit Sethi
Abstract:
We introduce a modular framework for predicting cancer-specific survival from whole slide pathology images (WSIs) that significantly improves upon the state-of-the-art accuracy. Our method integrating four key components. Firstly, to tackle large size of WSIs, we use dynamic patch selection via quantile-based thresholding for isolating prognostically informative tissue regions. Secondly, we use graph-guided k-means clustering to capture phenotype-level heterogeneity through spatial and morphological coherence. Thirdly, we use attention mechanisms that model both intra- and inter-cluster relationships to contextualize local features within global spatial relations between various types of tissue compartments. Finally, we use an expert-guided mixture density modeling for estimating complex survival distributions using Gaussian mixture models. The proposed model achieves a concordance index of $0.712 \pm 0.028$ and Brier score of $0.254 \pm 0.018$ on TCGA-KIRC (renal cancer), and a concordance index of $0.645 \pm 0.017$ and Brier score of $0.281 \pm 0.031$ on TCGA-LUAD (lung adenocarcinoma). These results are significantly better than the state-of-art and demonstrate predictive potential of the proposed method across diverse cancer types.
中文: 本研究提出一种模块化框架,通过整合动态切片选择、图引导聚类、注意力机制和混合密度建模,显著提升了基于全切片病理图像的癌症特异性生存预测精度,在肾癌和肺腺癌数据上均优于现有最佳方法。
English: This study presents a modular framework that enhances cancer-specific survival prediction from whole slide images by integrating dynamic patch selection, graph-guided clustering, attention mechanisms, and mixture density modeling, achieving superior accuracy across renal cancer and lung adenocarcinoma datasets.
Authors:Xu Yang, Qi Zhang, Shuming Jiang, Yaowen Xu, Zhaofan Zou, Hao Sun, Xuelong Li
Abstract:
With the rapid advancement of generative AI, synthetic content across images, videos, and audio has become increasingly realistic, amplifying the risk of misinformation. Existing detection approaches predominantly focus on binary classification while lacking detailed and interpretable explanations of forgeries, which limits their applicability in safety-critical scenarios. Moreover, current methods often treat each modality separately, without a unified benchmark for cross-modal forgery detection and interpretation. To address these challenges, we introduce METER, a unified, multi-modal benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content. Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations, including spatio-temporal localization, textual rationales, and forgery type tracing. Compared to prior benchmarks, METER offers broader modality coverage and richer interpretability metrics such as spatial/temporal IoU, multi-class tracing, and evidence consistency. We further propose a human-aligned, three-stage Chain-of-Thought (CoT) training strategy combining SFT, DPO, and a novel GRPO stage that integrates a human-aligned evaluator with CoT reasoning. We hope METER will serve as a standardized foundation for advancing generalizable and interpretable forgery detection in the era of generative media.
Chinese Summary: 针对现有伪造检测方法缺乏跨模态统一标准和可解释性的问题,我们提出了METER基准,通过涵盖图像、视频和音频的多模态数据集及证据链解释机制,结合人类对齐的三阶段训练策略,推动可泛化伪造检测技术的发展。
English Summary: The METER benchmark is introduced to address the limitations of current forgery detection methods by providing a unified, multi-modal framework that includes interpretable evidence chains across images, videos, and audio, supported by a human-aligned training strategy.
Authors:Jingyi Zheng, Zifan Peng, Yule Liu, Junfeng Wang, Yifan Liao, Wenhan Dong, Xinlei He
Abstract:
Smart contracts are trustworthy, immutable, and automatically executed programs on the blockchain. Their execution requires the Gas mechanism to ensure efficiency and fairness. However, due to non-optimal coding practices, many contracts contain Gas waste patterns that need to be optimized. Existing solutions mostly rely on manual discovery, which is inefficient, costly to maintain, and difficult to scale. Recent research uses large language models (LLMs) to explore new Gas waste patterns. However, it struggles to remain compatible with existing patterns, often produces redundant patterns, and requires manual validation/rewriting. To address this gap, we present GasAgent, the first multi-agent system for smart contract Gas optimization that combines compatibility with existing patterns and automated discovery/validation of new patterns, enabling end-to-end optimization. GasAgent consists of four specialized agents, Seeker, Innovator, Executor, and Manager, that collaborate in a closed loop to identify, validate, and apply Gas-saving improvements. Experiments on 100 verified real-world contracts demonstrate that GasAgent successfully optimizes 82 contracts, achieving an average deployment Gas savings of 9.97%. In addition, our evaluation confirms its compatibility with existing tools and validates the effectiveness of each module through ablation studies. To assess broader usability, we further evaluate 500 contracts generated by five representative LLMs across 10 categories and find that GasAgent optimizes 79.8% of them, with deployment Gas savings ranging from 4.79% to 13.93%, showing its usability as the optimization layer for LLM-assisted smart contract development.
中文:GasAgent是一个多代理系统,通过兼容现有模式并自动发现新Gas浪费模式,实现了智能合约Gas优化的端到端自动化,在真实合约和LLM生成合约中均取得显著节能效果。
English: GasAgent is a multi-agent system that automates the optimization of smart contract Gas usage by combining compatibility with existing patterns and discovering new ones, achieving significant savings in real-world and LLM-generated contracts.
Authors:Haoxiong Ren, Yangu He, Kwunhang Wong, Rui Bao, Ning Lin, Zhongrui Wang, Dashan Shang
Abstract:
Spiking Neural Networks (SNNs) are increasingly favored for deployment on resource-constrained edge devices due to their energy-efficient and event-driven processing capabilities. However, training SNNs remains challenging because of the computational intensity of traditional backpropagation algorithms adapted for spike-based systems. In this paper, we propose a novel software-hardware co-design that introduces a hardware-friendly training algorithm, Spiking Direct Feedback Alignment (SDFA) and implement it on a Resistive Random Access Memory (RRAM)-based In-Memory Computing (IMC) architecture, referred to as PipeSDFA, to accelerate SNN training. Software-wise, the computational complexity of SNN training is reduced by the SDFA through the elimination of sequential error propagation. Hardware-wise, a three-level pipelined dataflow is designed based on IMC architecture to parallelize the training process. Experimental results demonstrate that the PipeSDFA training accelerator incurs less than 2% accuracy loss on five datasets compared to baselines, while achieving 1.1X~10.5X and 1.37X~2.1X reductions in training time and energy consumption, respectively compared to PipeLayer.
Chinese: 本文提出PipeSDFA软硬件协同设计方案,采用脉冲直接反馈对齐算法和基于RRAM的内存计算架构,在保证精度的同时显著降低了脉冲神经网络训练时间和能耗。
English: This paper introduces PipeSDFA, a software-hardware co-design that employs the Spiking Direct Feedback Alignment algorithm and RRAM-based in-memory computing to efficiently accelerate SNN training with minimal accuracy loss and significant reductions in training time and energy consumption.
Authors:Xiaoyi Bao, Chenwei Xie, Hao Tang, Tingyu Weng, Xiaofeng Wang, Yun Zheng, Xingang Wang
Abstract:
In recent years, the introduction of Multi-modal Large Language Models (MLLMs) into video understanding tasks has become increasingly prevalent. However, how to effectively integrate temporal information remains a critical research focus. Traditional approaches treat spatial and temporal information separately. Due to issues like motion blur, it is challenging to accurately represent the spatial information of rapidly moving objects. This can lead to temporally important regions being underemphasized during spatial feature extraction, which in turn hinders accurate spatio-temporal interaction and video understanding. To address this limitation, we propose an innovative video representation method called Dynamic-Image (DynImg). Specifically, we introduce a set of non-key frames as temporal prompts to highlight the spatial areas containing fast-moving objects. During the process of visual feature extraction, these prompts guide the model to pay additional attention to the fine-grained spatial features corresponding to these regions. Moreover, to maintain the correct sequence for DynImg, we employ a corresponding 4D video Rotary Position Embedding. This retains both the temporal and spatial adjacency of DynImg, helping MLLM understand the spatio-temporal order within this combined format. Experimental evaluations reveal that DynImg surpasses the state-of-the-art methods by approximately 2% across multiple video understanding benchmarks, proving the effectiveness of our temporal prompts in enhancing video comprehension.
中文摘要:提出的动态图像方法通过引入时序提示来突出快速移动物体的空间特征,并采用四维位置编码保持时空关系,在多个视频理解基准上超越现有方法约2%,有效提升了视频理解能力。
English Summary: The proposed Dynamic-Image (DynImg) method enhances video understanding by using temporal prompts to highlight fast-moving objects' spatial features and employing 4D position embedding to preserve spatiotemporal relationships, achieving a 2% performance improvement over existing methods.
Authors:Bingqing Zhang, Zhuo Cao, Heming Du, Yang Li, Xue Li, Jiajun Liu, Sen Wang
Abstract:
Despite recent advances, Text-to-video retrieval (TVR) is still hindered by multiple inherent uncertainties, such as ambiguous textual queries, indistinct text-video mappings, and low-quality video frames. Although interactive systems have emerged to address these challenges by refining user intent through clarifying questions, current methods typically rely on heuristic or ad-hoc strategies without explicitly quantifying these uncertainties, limiting their effectiveness. Motivated by this gap, we propose UMIVR, an Uncertainty-Minimizing Interactive Text-to-Video Retrieval framework that explicitly quantifies three critical uncertainties-text ambiguity, mapping uncertainty, and frame uncertainty-via principled, training-free metrics: semantic entropy-based Text Ambiguity Score (TAS), Jensen-Shannon divergence-based Mapping Uncertainty Score (MUS), and a Temporal Quality-based Frame Sampler (TQFS). By adaptively generating targeted clarifying questions guided by these uncertainty measures, UMIVR iteratively refines user queries, significantly reducing retrieval ambiguity. Extensive experiments on multiple benchmarks validate UMIVR's effectiveness, achieving notable gains in Recall@1 (69.2\% after 10 interactive rounds) on the MSR-VTT-1k dataset, thereby establishing an uncertainty-minimizing foundation for interactive TVR.
中文摘要:提出的UMIVR框架通过量化文本歧义、映射不确定性和帧质量的新型指标,自适应生成澄清问题来解决文本到视频检索中的不确定性,显著提升了检索性能。
English Summary: The proposed UMIVR framework addresses uncertainties in text-to-video retrieval by quantifying text ambiguity, mapping uncertainty, and frame quality through novel metrics, adaptively generating clarifying questions to significantly improve retrieval performance.
Authors:Xiaojie Li, Ronghui Li, Shukai Fang, Shuzhao Xie, Xiaoyang Guo, Jiaqing Zhou, Junkun Peng, Zhi Wang
Abstract:
Well-coordinated, music-aligned holistic dance enhances emotional expressiveness and audience engagement. However, generating such dances remains challenging due to the scarcity of holistic 3D dance datasets, the difficulty of achieving cross-modal alignment between music and dance, and the complexity of modeling interdependent motion across the body, hands, and face. To address these challenges, we introduce SoulDance, a high-precision music-dance paired dataset captured via professional motion capture systems, featuring meticulously annotated holistic dance movements. Building on this dataset, we propose SoulNet, a framework designed to generate music-aligned, kinematically coordinated holistic dance sequences. SoulNet consists of three principal components: (1) Hierarchical Residual Vector Quantization, which models complex, fine-grained motion dependencies across the body, hands, and face; (2) Music-Aligned Generative Model, which composes these hierarchical motion units into expressive and coordinated holistic dance; (3) Music-Motion Retrieval Module, a pre-trained cross-modal model that functions as a music-dance alignment prior, ensuring temporal synchronization and semantic coherence between generated dance and input music throughout the generation process. Extensive experiments demonstrate that SoulNet significantly surpasses existing approaches in generating high-quality, music-coordinated, and well-aligned holistic 3D dance sequences.
中文摘要:SoulDance数据集和SoulNet框架通过提供高精度动作捕捉数据及分层模型,解决了生成与音乐对齐的整体舞蹈的难题,能协调身体、手部和面部的运动与音乐同步。
English Summary: The SoulDance dataset and SoulNet framework address the challenges of generating holistic, music-aligned dance by providing high-precision motion capture data and a hierarchical model that coordinates body, hand, and facial movements with music.
Authors:Ole-Christoffer Granmo, Youmna Abdelwahab, Per-Arne Andersen, Paul F. A. Clarke, Kunal Dumbre, Ylva Grønninsæter, Vojtech Halenka, Runar Helin, Lei Jiao, Ahmed Khalid, Rebekka Omslandseter, Rupsa Saha, Mayur Shende, Xuan Zhang
Abstract:
Pattern recognition with concise and flat AND-rules makes the Tsetlin Machine (TM) both interpretable and efficient, while the power of Tsetlin automata enables accuracy comparable to deep learning on an increasing number of datasets. We introduce the Graph Tsetlin Machine (GraphTM) for learning interpretable deep clauses from graph-structured input. Moving beyond flat, fixed-length input, the GraphTM gets more versatile, supporting sequences, grids, relations, and multimodality. Through message passing, the GraphTM builds nested deep clauses to recognize sub-graph patterns with exponentially fewer clauses, increasing both interpretability and data utilization. For image classification, GraphTM preserves interpretability and achieves 3.86%-points higher accuracy on CIFAR-10 than a convolutional TM. For tracking action coreference, faced with increasingly challenging tasks, GraphTM outperforms other reinforcement learning methods by up to 20.6%-points. In recommendation systems, it tolerates increasing noise to a greater extent than a Graph Convolutional Neural Network (GCN), e.g., for noise ratio 0.1, GraphTM obtains accuracy 89.86% compared to GCN's 70.87%. Finally, for viral genome sequence data, GraphTM is competitive with BiLSTM-CNN and GCN accuracy-wise, training 2.5x faster than GCN. The GraphTM's application to these varied fields demonstrates how graph representation learning and deep clauses bring new possibilities for TM learning.
Chinese: 图Tsetlin机(GraphTM)将Tsetlin机扩展至图结构数据,通过消息传递构建可解释的深层子句,在图像分类和推荐系统等应用中显著提升了准确性、效率及抗噪能力。
English: The Graph Tsetlin Machine (GraphTM) extends the Tsetlin Machine to graph-structured data, using message passing to create interpretable deep clauses that enhance accuracy, efficiency, and noise tolerance across diverse applications like image classification and recommendation systems.
Authors:Dong Shu, Haoyang Yuan, Yuchen Wang, Yanguang Liu, Huopu Zhang, Haiyan Zhao, Mengnan Du
Abstract:
Large vision-language models (LVLMs) have made significant progress in chart understanding. However, financial charts, characterized by complex temporal structures and domain-specific terminology, remain notably underexplored. We introduce FinChart-Bench, the first benchmark specifically focused on real-world financial charts. FinChart-Bench comprises 1,200 financial chart images collected from 2015 to 2024, each annotated with True/False (TF), Multiple Choice (MC), and Question Answering (QA) questions, totaling 7,016 questions. We conduct a comprehensive evaluation of 25 state-of-the-art LVLMs on FinChart-Bench. Our evaluation reveals critical insights: (1) the performance gap between open-source and closed-source models is narrowing, (2) performance degradation occurs in upgraded models within families, (3) many models struggle with instruction following, (4) both advanced models show significant limitations in spatial reasoning abilities, and (5) current LVLMs are not reliable enough to serve as automated evaluators. These findings highlight important limitations in current LVLM capabilities for financial chart understanding. The FinChart-Bench dataset is available at https://huggingface.co/datasets/Tizzzzy/FinChart-Bench.
中文摘要:FinChart-Bench是首个针对真实世界金融图表评估大型视觉语言模型的基准测试,尽管开源与闭源模型性能差距正在缩小,但研究揭示了当前模型在金融图表理解能力方面仍存在显著局限性。
English Summary: FinChart-Bench is the first comprehensive benchmark for evaluating large vision-language models on real-world financial charts, revealing significant limitations in their chart understanding capabilities despite narrowing performance gaps between open-source and closed-source models.
Authors:Junliang Luo, Katrin Tinn, Samuel Ferreira Duran, Di Wu, Xue Liu
Abstract:
Tokenized U.S. Treasuries have emerged as a prominent subclass of real-world assets (RWAs), offering cryptographically enforced, yield-bearing instruments collateralized by sovereign debt and deployed across multiple blockchain networks. While the market has expanded rapidly, empirical analyses of transaction-level behaviour remain limited. This paper conducts a quantitative, function-level dissection of U.S. Treasury-backed RWA tokens including BUIDL, BENJI, and USDY, across multi-chain: mostly Ethereum and Layer-2s. We analyze decoded contract calls to isolate core functional primitives such as issuance, redemption, transfer, and bridge activity, revealing segmentation in behaviour between institutional actors and retail users. To model address-level economic roles, we introduce a curvature-aware representation learning framework using Poincaré embeddings and liquidity-based graph features. Our method outperforms baseline models on our RWA Treasury dataset in role inference and generalizes to downstream tasks such as anomaly detection and wallet classification in broader blockchain transaction networks. These findings provide a structured understanding of functional heterogeneity and participant roles in tokenized Treasury in a transaction-level perspective, contributing new empirical evidence to the study of on-chain financialization.
中文摘要:本文对代币化美国国债交易进行定量分析,通过功能层面研究揭示了机构与零售用户的行为差异,并引入新型表征学习框架,在角色推断和下游任务中优于基线模型。
English Summary: This paper provides a quantitative analysis of tokenized U.S. Treasury transactions, revealing behavioral segmentation between institutional and retail users through function-level examination and introducing a novel representation learning framework that outperforms baseline models in role inference and downstream tasks.
Authors:Van-Hoang Le, Duc-Vu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
Abstract:
Large Language Models (LLMs) face significant challenges in specialized domains like law, where precision and domain-specific knowledge are critical. This paper presents a streamlined two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our approach employs a fine-tuned Bi-Encoder for rapid candidate retrieval, followed by a Cross-Encoder for precise re-ranking, both optimized through strategic negative example mining. Key innovations include the introduction of the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard negatives to mitigate training bias, which significantly improved re-ranking performance. Evaluated on the SoICT Hackathon 2024 for Legal Document Retrieval, our team, 4Huiter, achieved a top-three position. While top-performing teams employed ensemble models and iterative self-training on large bge-m3 architectures, our lightweight, single-pass approach offered a competitive alternative with far fewer parameters. The framework demonstrates that optimized data processing, tailored loss functions, and balanced negative sampling are pivotal for building robust retrieval-augmented systems in legal contexts.
中文摘要:本文提出了一种简化的两阶段检索与重排框架,通过优化负采样和定制评估指标,以轻量级架构显著提升了法律文档检索的效能。
English Summary: This paper introduces a streamlined two-stage retrieval and re-ranking framework that enhances legal document retrieval through optimized negative sampling and tailored metrics, achieving competitive performance with a lightweight architecture.
Authors:Jiequan Cui, Beier Zhu, Qingshan Xu, Xiaogang Xu, Pengguang Chen, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong
Abstract:
In this paper, we formulate the knowledge distillation (KD) as a conditional generative problem and propose the \textit{Generative Distribution Distillation (GenDD)} framework. A naive \textit{GenDD} baseline encounters two major challenges: the curse of high-dimensional optimization and the lack of semantic supervision from labels. To address these issues, we introduce a \textit{Split Tokenization} strategy, achieving stable and effective unsupervised KD. Additionally, we develop the \textit{Distribution Contraction} technique to integrate label supervision into the reconstruction objective. Our theoretical proof demonstrates that \textit{GenDD} with \textit{Distribution Contraction} serves as a gradient-level surrogate for multi-task learning, realizing efficient supervised training without explicit classification loss on multi-step sampling image representations. To evaluate the effectiveness of our method, we conduct experiments on balanced, imbalanced, and unlabeled data. Experimental results show that \textit{GenDD} performs competitively in the unsupervised setting, significantly surpassing KL baseline by \textbf{16.29\%} on ImageNet validation set. With label supervision, our ResNet-50 achieves \textbf{82.28\%} top-1 accuracy on ImageNet in 600 epochs training, establishing a new state-of-the-art.
中文: 本文提出生成式分布蒸馏框架,通过拆分标记化和分布收缩技术解决知识蒸馏中的优化难题,在ImageNet数据集上无论有无标签监督均实现了最优性能。
English: This paper introduces the Generative Distribution Distillation (GenDD) framework, which addresses challenges in knowledge distillation through Split Tokenization and Distribution Contraction techniques, achieving state-of-the-art performance on ImageNet in both supervised and unsupervised settings.
Authors:Federico Mason, Federico Chiariotti, Pietro Talli, Andrea Zanella
Abstract:
Goal-oriented Communication (GoC) is a new paradigm that plans data transmission to occur only when it is instrumental for the receiver to achieve a certain goal. This leads to the advantage of reducing the frequency of transmissions significantly while maintaining adherence to the receiver's objectives. However, GoC scheduling also opens a timing-based side channel that an eavesdropper can exploit to obtain information about the state of the system. This type of attack sidesteps even information-theoretic security, as it exploits the timing of updates rather than their content. In this work, we study such an eavesdropping attack against pull-based goal-oriented scheduling for remote monitoring and control of Markov processes. We provide a theoretical framework for defining the effectiveness of the attack and propose possible countermeasures, including two practical heuristics that provide a balance between the performance gains offered by GoC and the amount of leaked information. Our results show that, while a naive goal-oriented scheduler allows the eavesdropper to correctly guess the system state about 60% of the time, our heuristic defenses can halve the leakage with a marginal reduction of the benefits of goal-oriented approaches.
Chinese: 目标导向通信通过按需传输减少通信频率,但会暴露基于时序的侧信道,易受窃听攻击;本文提出启发式防御措施,能在几乎不影响性能的前提下将信息泄露减半。
English: Goal-oriented Communication reduces transmissions by aligning them with receiver goals but introduces a timing side channel vulnerable to eavesdropping, prompting the development of heuristic defenses that halve information leakage with minimal performance loss.
Authors:Lucas Heublein, Christian Wielenberg, Thorsten Nowak, Tobias Feigl, Christopher Mutschler, Felix Ott
Abstract:
Jamming devices disrupt signals from the global navigation satellite system (GNSS) and pose a significant threat by compromising the reliability of accurate positioning. Consequently, the detection and localization of these interference signals are essential to achieve situational awareness, mitigating their impact, and implementing effective counter-measures. Classical Angle of Arrival (AoA) methods exhibit reduced accuracy in multipath environments due to signal reflections and scattering, leading to localization errors. Additionally, AoA-based techniques demand substantial computational resources for array signal processing. In this paper, we propose a novel approach for detecting and classifying interference while estimating the distance, azimuth, and elevation of jamming sources. Our benchmark study evaluates 128 vision encoder and time-series models to identify the highest-performing methods for each task. We introduce an attention-based fusion framework that integrates in-phase and quadrature (IQ) samples with Fast Fourier Transform (FFT)-computed spectrograms while incorporating 22 AoA features to enhance localization accuracy. Furthermore, we present a novel dataset of moving jamming devices recorded in an indoor environment with dynamic multipath conditions and demonstrate superior performance compared to state-of-the-art methods.
Chinese: 本文提出了一种新颖的干扰检测、分类与定位方法,通过融合IQ样本、FFT频谱图和AoA特征,在动态多路径环境中展现出优于现有技术的性能,并提供了新的数据集和全面基准研究。
English: This paper introduces a novel method for detecting, classifying, and localizing jamming devices by integrating IQ samples, FFT spectrograms, and AoA features, demonstrating superior performance in dynamic multipath environments through a comprehensive benchmark and a new dataset.
Authors:Philip Wiese, Victor Kartsch, Marco Guermandi, Luca Benini
Abstract:
The widespread adoption of Internet of Things (IoT) technologies has significantly advanced environmental monitoring (EM) by enabling cost-effective and scalable sensing solutions. Concurrently, machine learning (ML) and artificial intelligence (AI) are introducing powerful tools for the efficient and accurate analysis of complex environmental data. However, current IoT platforms for environmental sensing are typically limited to a narrow set of sensors, preventing a comprehensive assessment of environmental conditions and lacking sufficient computational capabilities to support the deployment of advanced ML and AI algorithms on the edge. To overcome these limitations, we introduce a compact (17x38 mm2), multi-modal, MCU-based environmental IoT node integrating 11 sensors, including CO2 concentration, volatile organic compounds (VOCs), light intensity, UV radiation, pressure, temperature, humidity, visual sensing via an RGB camera, and precise geolocation through a GNSS module. It features GAP9, a parallel ultra-low-power system-on-chip, enabling real-time, energy-efficient edge processing of advanced ML models directly on-device. We implemented a YOLOv5-based occupancy detection pipeline (0.3 M parameters, 42 MOP per inference), demonstrating 42% energy savings over raw data streaming. Additionally, we present a smart indoor air quality (IAQ) monitoring setup that combines occupancy detection with adaptive sample rates, achieving operational times of up to 143 h on a single compact 600 mAh, 3.7 V battery. Our platform lays the groundwork for innovative applications such as predictive indoor IAQ, enabling efficient AI-driven on-edge forecasting for energy-efficient and autonomous, proactive pollution-mitigation control strategies
中文: 本研究开发了一种紧凑型多传感器物联网平台,集成11种环境传感器与超低功耗处理器,能够实时运行先进机器学习模型的边缘计算,通过人员检测和空气质量监测等应用展示了显著节能效果和延长电池续航能力,有效解决了现有系统传感器单一及边缘算力不足的局限性。
English: This study introduces a compact, multi-sensor IoT platform that integrates 11 environmental sensors and an ultra-low-power processor to enable real-time edge computing of advanced machine learning models, overcoming the limitations of current systems by demonstrating significant energy savings and extended battery life for applications like occupancy detection and air quality monitoring.
Authors:Maximilian Rokuss, Benjamin Hamm, Yannick Kirchhoff, Klaus Maier-Hein
Abstract:
We introduce the first publicly available breast MRI dataset with explicit left and right breast segmentation labels, encompassing more than 13,000 annotated cases. Alongside this dataset, we provide a robust deep-learning model trained for left-right breast segmentation. This work addresses a critical gap in breast MRI analysis and offers a valuable resource for the development of advanced tools in women's health. The dataset and trained model are publicly available at: www.github.com/MIC-DKFZ/BreastDivider
中文: 本研究首次公开了包含超过13,000例带左右乳腺分割标注的乳腺MRI数据集,并提供了经过训练的鲁棒深度学习模型,填补了乳腺MRI分析的关键空白,为女性健康工具开发提供了宝贵资源。
English: This study presents the first publicly available breast MRI dataset with over 13,000 annotated cases featuring explicit left and right breast segmentation labels, accompanied by a robust deep-learning model for segmentation, addressing a critical gap in breast MRI analysis and providing a valuable resource for women's health tool development.
Authors:Haoyang Li, Yuming Xu, Yiming Li, Hanmo Liu, Darian Li, Chen Jason Zhang, Lei Chen, Qing Li
Abstract:
Temporal link prediction in dynamic graphs is a critical task with applications in diverse domains such as social networks, recommendation systems, and e-commerce platforms. While existing Temporal Graph Neural Networks (T-GNNs) have achieved notable success by leveraging complex architectures to model temporal and structural dependencies, they often suffer from scalability and efficiency challenges due to high computational overhead. In this paper, we propose EAGLE, a lightweight framework that integrates short-term temporal recency and long-term global structural patterns. EAGLE consists of a time-aware module that aggregates information from a node's most recent neighbors to reflect its immediate preferences, and a structure-aware module that leverages temporal personalized PageRank to capture the influence of globally important nodes. To balance these attributes, EAGLE employs an adaptive weighting mechanism to dynamically adjust their contributions based on data characteristics. Also, EAGLE eliminates the need for complex multi-hop message passing or memory-intensive mechanisms, enabling significant improvements in efficiency. Extensive experiments on seven real-world temporal graphs demonstrate that EAGLE consistently achieves superior performance against state-of-the-art T-GNNs in both effectiveness and efficiency, delivering more than a 50x speedup over effective transformer-based T-GNNs.
中文: EAGLE框架通过自适应权重机制整合短期时间邻近性与长期全局结构模式,在提升动态图预测效果的同时实现超过50倍的加速,显著优于现有时序图神经网络。
English: The proposed EAGLE framework enhances temporal link prediction by efficiently combining short-term recency and long-term structural patterns through adaptive weighting, achieving superior performance and over 50x speedup compared to existing T-GNNs.
Authors:Ye Tian, Xiaoyuan Ren, Zihao Wang, Onat Gungor, Xiaofan Yu, Tajana Rosing
Abstract:
Rich and context-aware activity logs facilitate user behavior analysis and health monitoring, making them a key research focus in ubiquitous computing. The remarkable semantic understanding and generation capabilities of Large Language Models (LLMs) have recently created new opportunities for activity log generation. However, existing methods continue to exhibit notable limitations in terms of accuracy, efficiency, and semantic richness. To address these challenges, we propose DailyLLM. To the best of our knowledge, this is the first log generation and summarization system that comprehensively integrates contextual activity information across four dimensions: location, motion, environment, and physiology, using only sensors commonly available on smartphones and smartwatches. To achieve this, DailyLLM introduces a lightweight LLM-based framework that integrates structured prompting with efficient feature extraction to enable high-level activity understanding. Extensive experiments demonstrate that DailyLLM outperforms state-of-the-art (SOTA) log generation methods and can be efficiently deployed on personal computers and Raspberry Pi. Utilizing only a 1.5B-parameter LLM model, DailyLLM achieves a 17% improvement in log generation BERTScore precision compared to the 70B-parameter SOTA baseline, while delivering nearly 10x faster inference speed.
中文摘要:DailyLLM提出了一种轻量级框架,利用智能手机和智能手表的传感器,通过整合位置、运动、环境和生理四个维度的上下文数据来生成活动日志,在BERTScore精度上相比现有方法提升17%,推理速度提高近10倍。
English Summary: DailyLLM introduces a lightweight framework using smartphone and smartwatch sensors to generate activity logs by integrating contextual data across four dimensions, achieving superior accuracy and efficiency over existing methods with a 17% BERTScore improvement and 10x faster inference speed.
Authors:Tyler Loakman, William Thorne, Chenghua Lin
Abstract:
Humour, as a complex language form, is derived from myriad aspects of life. Whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular form. We compare models' joke explanation abilities from simple puns to complex topical humour that requires esoteric knowledge of real-world entities and events. To this end, we curate a dataset of 600 jokes across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (including reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most existing works on overly simple joke forms.
中文摘要:本研究评估了大语言模型解释多种幽默类型的能力,发现尽管技术有所进步,它们仍难以充分解释除简单双关语之外的复杂笑话类型。
English Summary: This study evaluates the ability of Large Language Models to explain diverse humor types, revealing their limitations in handling complex jokes beyond simple puns despite advancements.
Authors:Jiazheng Li, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Hongzhou Lin, Yi Wu, Jingzhao Zhang
Abstract:
Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 67.1% (+5.3%) on AIME24, 59.5% (+10.0%) on AIME25, and 35.5% (+4.0%) on HMMT25. Further, we provide theoretical explanations that QuestA improves sample efficiency, offering a practical and generalizable pathway for expanding reasoning capability through RL.
中文: 针对强化学习在提升多步推理能力方面的不足,研究者提出QuestA方法,通过问题增强策略在训练中引入部分解决方案以降低难度并强化学习信号,该方法在数学推理基准测试中取得最优结果,并从理论上解释了其提升样本效率的机制。
English: To address reinforcement learning's limitations in improving multi-step reasoning for challenging problems, the authors propose QuestA, a question augmentation strategy that incorporates partial solutions during training to reduce difficulty and enhance learning signals, achieving state-of-the-art results on math benchmarks and providing theoretical insights into improved sample efficiency.
Authors:Wenting Zhu, Chaozhuo Li, Qingpo Yang, Xi Zhang, Philip S. Yu
Abstract:
Information diffusion prediction (IDP) is a pivotal task for understanding how information propagates among users. Most existing methods commonly adhere to a conventional training-test paradigm, where models are pretrained on training data and then directly applied to test samples. However, the success of this paradigm hinges on the assumption that the data are independently and identically distributed, which often fails in practical social networks due to the inherent uncertainty and variability of user behavior. In the paper, we address the novel challenge of distribution shifts within IDP tasks and propose a robust test-time training (TTT)-based framework for multi-scale diffusion prediction, named T3MAL. The core idea is to flexibly adapt a trained model to accommodate the distribution of each test instance before making predictions via a self-supervised auxiliary task. Specifically, T3MAL introduces a BYOL-inspired self-supervised auxiliary network that shares a common feature extraction backbone with the primary diffusion prediction network to guide instance-specific adaptation during testing. Furthermore, T3MAL enables fast and accurate test-time adaptation by incorporating a novel meta-auxiliary learning scheme and a lightweight adaptor, which together provide better weight initialization for TTT and mitigate catastrophic forgetting. Extensive experiments on three public datasets demonstrate that T3MAL outperforms various state-of-the-art methods.
中文摘要:本文提出T3MAL框架,通过自监督辅助任务和元辅助学习方案,在测试时针对每个实例进行模型自适应,有效解决了信息传播预测中的分布偏移问题。
English Summary: This paper introduces T3MAL, a test-time training framework that addresses distribution shifts in information diffusion prediction by adapting models to individual test instances through self-supervised learning and meta-auxiliary techniques.
Authors:Potsawee Manakul, Woody Haosheng Gan, Michael J. Ryan, Ali Sartaz Khan, Warit Sirichotedumrong, Kunat Pipatanakul, William Held, Diyi Yang
Abstract:
Current speech evaluation suffers from two critical limitations: the need and difficulty of designing specialized systems targeting individual audio characteristics, and poor correlation between automatic evaluation methods and human preferences. This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges. We systematically explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking. We investigate different prompt engineering strategies, finding that audio concatenation combined with in-context learning significantly improves performance across both audio characteristic detection and human preference simulation tasks. We further introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on our system ranking benchmark. Robustness analysis reveals that while LAMs maintain strong performance under acoustic noise, they exhibit significant verbosity and positional biases that require careful mitigation.
中文:本研究提出AudioJudge框架,通过大型音频模型实现统一语音评估,采用多维度集成方法和改进的提示策略显著提升与人类偏好的相关性,但需解决模型冗长和位置偏差问题。
English: This study introduces AudioJudge, a unified evaluation framework using Large Audio Models to overcome limitations in speech assessment by achieving high correlation with human preferences through multi-aspect ensemble methods and improved prompt engineering, though it requires mitigation of verbosity and positional biases.
Authors:Ashkan Shakarami, Azade Farshad, Yousef Yeganeh, Lorenzo Nicole, Peter Schuffler, Stefano Ghidoni, Nassir Navab
Abstract:
We propose UTS, a unit-based tissue segmentation framework for histopathology that classifies each fixed-size 32 * 32 tile, rather than each pixel, as the segmentation unit. This approach reduces annotation effort and improves computational efficiency without compromising accuracy. To implement this approach, we introduce a Multi-Level Vision Transformer (L-ViT), which benefits the multi-level feature representation to capture both fine-grained morphology and global tissue context. Trained to segment breast tissue into three categories (infiltrating tumor, non-neoplastic stroma, and fat), UTS supports clinically relevant tasks such as tumor-stroma quantification and surgical margin assessment. Evaluated on 386,371 tiles from 459 H&E-stained regions, it outperforms U-Net variants and transformer-based baselines. Code and Dataset will be available at GitHub.
中文:UTS框架采用基于图块的分割方法和多层视觉变换器,有效将乳腺组织分为三类,在降低计算需求的同时,其性能优于现有模型。
English: The UTS framework introduces a tile-based segmentation method using a Multi-Level Vision Transformer to efficiently classify breast tissue into three categories, achieving superior performance over existing models with reduced computational demands.
Authors:Zhenmin Huang, Yusen Xie, Benshan Ma, Shaojie Shen, Jun Ma
Abstract:
Trajectory planning involving multi-agent interactions has been a long-standing challenge in the field of robotics, primarily burdened by the inherent yet intricate interactions among agents. While game-theoretic methods are widely acknowledged for their effectiveness in managing multi-agent interactions, significant impediments persist when it comes to accommodating the intentional uncertainties of agents. In the context of intentional uncertainties, the heavy computational burdens associated with existing game-theoretic methods are induced, leading to inefficiencies and poor scalability. In this paper, we propose a novel game-theoretic interactive trajectory planning method to effectively address the intentional uncertainties of agents, and it demonstrates both high efficiency and enhanced scalability. As the underpinning basis, we model the interactions between agents under intentional uncertainties as a general Bayesian game, and we show that its agent-form equivalence can be represented as a potential game under certain minor assumptions. The existence and attainability of the optimal interactive trajectories are illustrated, as the corresponding Bayesian Nash equilibrium can be attained by optimizing a unified optimization problem. Additionally, we present a distributed algorithm based on the dual consensus alternating direction method of multipliers (ADMM) tailored to the parallel solving of the problem, thereby significantly improving the scalability. The attendant outcomes from simulations and experiments demonstrate that the proposed method is effective across a range of scenarios characterized by general forms of intentional uncertainties. Its scalability surpasses that of existing centralized and decentralized baselines, allowing for real-time interactive trajectory planning in uncertain game settings.
中文摘要:本文提出了一种新颖的博弈论交互轨迹规划方法,通过将智能体间的意图不确定性建模为贝叶斯博弈,并采用基于ADMM的分布式算法,实现了高效且可扩展的实时轨迹规划。
English Summary: This paper introduces a novel game-theoretic trajectory planning method that effectively handles multi-agent intentional uncertainties by modeling them as a Bayesian game, achieving high efficiency and scalability through a distributed ADMM-based algorithm.
Authors:Miray Ãzcan, Philipp Wiesner, Philipp WeiÃ, Odej Kao
Abstract:
The environmental impact of Large Language Models (LLMs) is rising significantly, with inference now accounting for more than half of their total lifecycle carbon emissions. However, existing simulation frameworks, which are increasingly used to determine efficient LLM deployments, lack any concept of power and, therefore, cannot accurately estimate inference-related emissions. We present a simulation framework to assess the energy and carbon implications of LLM inference under varying deployment setups. First, we extend a high-fidelity LLM inference simulator with a GPU power model that estimates power consumption based on utilization metrics, enabling analysis across configurations like batch size, sequence length, and model parallelism. Second, we integrate simulation outputs into an energy system co-simulation environment to quantify carbon emissions under specific grid conditions and explore the potential of carbon-aware scheduling. Through scenario-based analysis, our framework reveals how inference parameters affect energy demand and carbon footprint, demonstrates a renewable offset potential of up to 69.2% in an illustrative deployment case, and provides a foundation for future carbon-aware inference infrastructure design.
中文: 大型语言模型的环境影响日益显著,其中推理过程已占其全生命周期碳排放的一半以上,我们提出的仿真框架通过集成GPU功耗模型与能源系统协同仿真,能够精确评估能耗与碳排放,并利用碳感知调度实现高达69.2%的可再生能源抵消潜力。
English: The environmental impact of Large Language Models is growing, with inference now responsible for over half of their lifecycle carbon emissions, and we introduce a simulation framework that incorporates GPU power modeling and energy system integration to accurately assess and reduce energy use and carbon emissions through carbon-aware scheduling.
Authors:Philipp Wiesner, Odej Kao
Abstract:
Marginal Carbon Intensity (MCI) has been promoted as an effective metric for carbon-aware computing. Although it is already considered as impractical for carbon accounting purposes, many still view it as valuable when optimizing for grid flexibility by incentivizing electricity usage during curtailment periods. In this statement paper, we argue that MCI is neither reliable nor actionable for either purpose. We outline its fundamental limitations, including non-observability, reliance on opaque predictive models, and the lack of verifiability. Moreover, MCI fails to reflect curtailment caused by high-carbon sources and offers no insight into the quantity of available excess power. We advocate moving beyond MCI and instead call for research on more actionable metrics, such as direct reporting of excess power, explicit modeling of energy storage and grid stability, and integration with emerging granular renewable energy certificate markets.
Chinese: 边际碳强度(MCI)指标因其不可观测性、依赖不透明模型以及无法准确反映限电情况或剩余电力可用性,在碳核算和电网灵活性优化中均不可靠且不实用,亟需开发更具可操作性的替代方案。
English: The Marginal Carbon Intensity (MCI) metric is unreliable and impractical for both carbon accounting and grid flexibility optimization due to its non-observability, reliance on opaque models, and inability to accurately reflect curtailment or excess power availability, necessitating the development of more actionable alternatives.
Authors:Yunhao Yang, Neel P. Bhatt, Christian Ellis, Alvaro Velasquez, Zhangyang Wang, Ufuk Topcu
Abstract:
Logistics operators, from battlefield coordinators rerouting airlifts ahead of a storm to warehouse managers juggling late trucks, often face life-critical decisions that demand both domain expertise and rapid and continuous replanning. While popular methods like integer programming yield logistics plans that satisfy user-defined logical constraints, they are slow and assume an idealized mathematical model of the environment that does not account for uncertainty. On the other hand, large language models (LLMs) can handle uncertainty and promise to accelerate replanning while lowering the barrier to entry by translating free-form utterances into executable plans, yet they remain prone to misinterpretations and hallucinations that jeopardize safety and cost. We introduce a neurosymbolic framework that pairs the accessibility of natural-language dialogue with verifiable guarantees on goal interpretation. It converts user requests into structured planning specifications, quantifies its own uncertainty at the field and token level, and invokes an interactive clarification loop whenever confidence falls below an adaptive threshold. A lightweight model, fine-tuned on just 100 uncertainty-filtered examples, surpasses the zero-shot performance of GPT-4.1 while cutting inference latency by nearly 50%. These preliminary results highlight a practical path toward certifiable, real-time, and user-aligned decision-making for complex logistics.
中文摘要:该神经符号框架将自然语言交互与可验证保障相结合,通过将用户请求转化为结构化规划并采用交互式澄清机制处理不确定性,在仅需100个样本微调后即超越GPT-4.1的零样本性能,同时降低近50%推理延迟。
English Summary: The neurosymbolic framework introduced combines natural language accessibility with verifiable guarantees, converting user requests into structured plans while using interactive clarification loops to address uncertainty, achieving improved performance and reduced latency compared to GPT-4.1.
Authors:Jingzehua Xu, Guanwen Xie, Weiyi Liu, Jiwei Tang, Ziteng Yang, Tianxiang Xing, Yiyuan Yang, Shuai Zhang, Xiaofan Li
Abstract:
Autonomous Underwater Vehicles (AUVs) are essential for marine exploration, yet their control remains highly challenging due to nonlinear dynamics and uncertain environmental disturbances. This paper presents a diffusion-augmented Reinforcement Learning (RL) framework for robust AUV control, aiming to improve AUV's adaptability in dynamic underwater environments. The proposed framework integrates two core innovations: (1) A diffusion-based action generation framework that produces physically feasible and high-quality actions, enhanced by a high-dimensional state encoding mechanism combining current observations with historical states and actions through a novel diffusion U-Net architecture, significantly improving long-horizon planning capacity for robust control. (2) A sample-efficient hybrid learning architecture that synergizes diffusion-guided exploration with RL policy optimization, where the diffusion model generates diverse candidate actions and the RL critic selects the optimal action, achieving higher exploration efficiency and policy stability in dynamic underwater environments. Extensive simulation experiments validate the framework's superior robustness and flexibility, outperforming conventional control methods in challenging marine conditions, offering enhanced adaptability and reliability for AUV operations in underwater tasks. Finally, we will release the code publicly soon to support future research in this area.
中文: 本文提出一种扩散增强强化学习框架,通过扩散动作生成和混合学习提升自主水下航行器的控制鲁棒性,仿真验证其在动态水下环境中优于传统方法。
English: This paper introduces a diffusion-augmented reinforcement learning framework that enhances AUV control robustness through diffusion-based action generation and hybrid learning, validated by simulations to outperform traditional methods in dynamic underwater environments.
Authors:Xinkui Zhao, Jinsong Shu, Yangyang Wu, Guanjie Cheng, Zihe Liu, Naibo Wang, Shuiguang Deng, Zhongle Xie, Jianwei Yin
Abstract:
Multimodal Emotion Recognition (MER) often encounters incomplete multimodality in practical applications due to sensor failures or privacy protection requirements. While existing methods attempt to address various incomplete multimodal scenarios by balancing the training of each modality combination through additional gradients, these approaches face a critical limitation: training gradients from different modality combinations conflict with each other, ultimately degrading the performance of the final prediction model. In this paper, we propose a unimodal decoupled dynamic low-rank adaptation method based on modality combinations, named MCULoRA, which is a novel framework for the parameter-efficient training of incomplete multimodal learning models. MCULoRA consists of two key modules, modality combination aware low-rank adaptation (MCLA) and dynamic parameter fine-tuning (DPFT). The MCLA module effectively decouples the shared information from the distinct characteristics of individual modality combinations. The DPFT module adjusts the training ratio of modality combinations based on the separability of each modality's representation space, optimizing the learning efficiency across different modality combinations. Our extensive experimental evaluation in multiple benchmark datasets demonstrates that MCULoRA substantially outperforms previous incomplete multimodal learning approaches in downstream task accuracy.
中文: 多模态情感识别常面临数据不完整的问题,本文提出的MCULoRA方法通过解耦模态组合的共享与特有信息,并动态优化训练效率,在多个基准数据集上显著超越了现有方法。
English: Multimodal Emotion Recognition faces challenges with incomplete data, which the proposed MCULoRA method addresses by decoupling shared and distinct information across modality combinations and dynamically optimizing training efficiency, achieving superior performance over existing approaches.
Authors:Bo Zhao, Maya Okawa, Eric J. Bigelow, Rose Yu, Tomer Ullman, Ekdeep Singh Lubana, Hidenori Tanaka
Abstract:
As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Inspired by emotion wheels -- a psychological framework that argues emotions organize hierarchically -- we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.
中文摘要:大型语言模型能自然形成与人类心理模型一致的情感层次结构,但在不同社会经济群体间存在系统性情感识别偏差,揭示了模型对社会认知的内化,同时为基于认知理论的模型评估提供了新思路。
English Summary: Large language models naturally form hierarchical emotion trees that align with human psychology, yet exhibit systematic biases in emotion recognition across socioeconomic groups, revealing their internalization of social perceptions while suggesting cognitively-grounded evaluation methods.
Authors:Hatef Otroshi Shahreza, Sébastien Marcel
Abstract:
Multimodal large language models (MLLMs) have shown remarkable performance in vision-language tasks. However, existing MLLMs are primarily trained on generic datasets, limiting their ability to reason on domain-specific visual cues such as those in facial images. In particular, tasks that require detailed understanding of facial structure, expression, emotion, and demographic features remain underexplored by MLLMs due to the lack of large-scale annotated face image-text datasets. In this work, we introduce FaceLLM, a multimodal large language model trained specifically for facial image understanding. To construct the training data, we propose a novel weakly supervised pipeline that uses ChatGPT with attribute-aware prompts to generate high-quality question-answer pairs based on images from the FairFace dataset. The resulting corpus, called FairFaceGPT, covers a diverse set of attributes including expression, pose, skin texture, and forensic information. Our experiments demonstrate that FaceLLM improves the performance of MLLMs on various face-centric tasks and achieves state-of-the-art performance. This work highlights the potential of synthetic supervision via language models for building domain-specialized MLLMs, and sets a precedent for trustworthy, human-centric multimodal AI systems. FairFaceGPT dataset and pretrained FaceLLM models are publicly available in the project page.
中文: FaceLLM是一种专为面部图像理解设计的模态大语言模型,通过使用ChatGPT的弱监督流程生成高质量数据训练而成,显著提升了面部相关任务的性能,并为领域专用AI系统树立了先例。
English: FaceLLM is a multimodal large language model specifically designed for facial image understanding, trained using a weakly supervised pipeline that generates high-quality data via ChatGPT, which enhances performance on face-centric tasks and sets a precedent for domain-specialized AI systems.
Authors:Tung Sum Thomas Kwok, Zeyong Zhang, Chi-Hua Wang, Guang Cheng
Abstract:
Tabular data synthesis for supervised learning ('SL') model training is gaining popularity in industries such as healthcare, finance, and retail. Despite the progress made in tabular data generators, models trained with synthetic data often underperform compared to those trained with original data. This low SL utility of synthetic data stems from class imbalance exaggeration and SL data relationship overlooked by tabular generator. To address these challenges, we draw inspirations from techniques in emerging data-centric artificial intelligence and elucidate Pruning and ReOrdering ('PRRO'), a novel pipeline that integrates data-centric techniques into tabular data synthesis. PRRO incorporates data pruning to guide the table generator towards observations with high signal-to-noise ratio, ensuring that the class distribution of synthetic data closely matches that of the original data. Besides, PRRO employs a column reordering algorithm to align the data modeling structure of generators with that of SL models. These two modules enable PRRO to optimize SL utility of synthetic data. Empirical experiments on 22 public datasets show that synthetic data generated using PRRO enhances predictive performance compared to data generated without PRRO. Specifically, synthetic replacement of original data yields an average improvement of 26.74% and up to 871.46% improvement using PRRO, while synthetic appendant to original data results with PRRO-generated data results in an average improvement of 6.13% and up to 200.32%. Furthermore, experiments on six highly imbalanced datasets show that PRRO enables the generator to produce synthetic data with a class distribution that resembles the original data more closely, achieving a similarity improvement of 43%. Through PRRO, we foster a seamless integration of data synthesis to subsequent SL prediction, promoting quality and accessible data analysis.
中文摘要:PRRO流程通过整合数据剪枝和列重排序技术,有效解决合成表格数据在类别失衡和结构错配上的问题,显著提升了监督学习模型的预测性能,经实证测试性能改进显著。
English Summary: The PRRO pipeline enhances synthetic tabular data utility for supervised learning by incorporating data pruning and column reordering to address class imbalance and structural misalignment, achieving significant performance improvements in empirical tests.
Authors:Lance Ying, Ryan Truong, Joshua B. Tenenbaum, Samuel J. Gershman
Abstract:
Social learning is a powerful mechanism through which agents learn about the world from others. However, humans don't always choose to observe others, since social learning can carry time and cognitive resource costs. How do people balance social and non-social learning? In this paper, we propose a rational mentalizing model of the decision to engage in social learning. This model estimates the utility of social learning by reasoning about the other agent's goal and the informativity of their future actions. It then weighs the utility of social learning against the utility of self-exploration (non-social learning). Using a multi-player treasure hunt game, we show that our model can quantitatively capture human trade-offs between social and non-social learning. Furthermore, our results indicate that these two components allow agents to flexibly apply social learning to achieve their goals more efficiently.
中文: 本文提出了一种理性心理化模型,通过评估他人目标和行动信息性来权衡社会与非社会学习的效用,该模型在寻宝游戏中有效模拟了人类决策,并提高了达成目标的效率。
English: The paper introduces a rational mentalizing model that balances social and non-social learning by evaluating their utilities based on others' goals and action informativeness, which effectively captures human decision-making in a treasure hunt game and enhances goal achievement efficiency.
Authors:Yunsoo Kim, Jinge Wu, Honghan Wu
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated promising performance in chest X-ray (CXR) analysis. To enhance human-computer interaction, several studies have incorporated radiologists' eye gaze, typically through heatmaps or textual prompts. However, these methods often overlook the sequential order of eye movements, which could provide valuable insights by highlighting both the areas of interest and the order in which they are examined. In this work, we propose a novel approach called RadEyeVideo that integrates radiologists' eye-fixation data as a video sequence, capturing both the temporal and spatial dynamics of their gaze. We evaluate this method in CXR report generation and disease diagnosis using three general-domain, open-source LVLMs with video input capabilities. When prompted with eye-gaze videos, model performance improves by up to 24.6% in the report generation task and on average 15.2% for both tasks using scaled evaluation metrics. Notably, RadEyeVideo enhanced an open-domain LVLM model, LLaVA-OneVision, to surpass task-specific medical LVLMs such as MAIRA-2 and CheXagent, trained on large Chest X-ray data. This work highlights that domain expert's knowledge (eye-gaze information in this case), when effectively integrated with LVLMs, can significantly enhance general-domain models' capabilities in clinical tasks. RadEyeVideo is a step toward a scalable human-centered approach of utilizing LVLMs in medical image analytics.
Chinese Summary: 提出的RadEyeVideo方法通过将放射科医生的眼动序列作为视频输入整合到大型视觉语言模型中,显著提升了胸部X光分析性能,在报告生成任务中性能提升高达24.6%,并使通用领域模型超越了专业医学模型的表现。
English Summary: The proposed RadEyeVideo method enhances chest X-ray analysis by integrating radiologists' eye-gaze sequences as video input into large vision-language models, achieving performance improvements of up to 24.6% in report generation and enabling general-domain models to surpass specialized medical models.
Authors:Daewon Choi, Jimin Lee, Jihoon Tack, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati
Abstract:
Recent large language models have shown promising capabilities in long-form reasoning, following structured chains of thought before arriving at a final answer. However, we observe that these reasoning paths tend to include substantial redundancy; analyzing attention patterns reveals that attention scores are widely scattered, particularly incorrect answers exhibit greater attention sparsity. In this paper, we demonstrate that deliberately removing this redundancy in the reasoning process significantly improves performance through clear thinking, i.e., removing distraction. Specifically, we systematically identify reasoning redundancy by measuring token-level attention scores to a special end-of-thinking token, which is appended to an explicit instruction inserted to conclude each intermediate reasoning step. Furthermore, we propose structure-aware pruning that prioritizes removing tokens in low-contributing reasoning chunks over individual tokens. After evicting redundant tokens, we remove the injected end-of-thinking instruction, then resume the reasoning generation. We demonstrate that our method significantly improves overall accuracy across reasoning-intensive benchmarks without any training involved. In particular, our method shows strong performance on challenging mathematical competition benchmarks such as AIME and AMC, where reasoning redundancy is more prevalent.
Chinese: 最新大型语言模型在推理路径中存在冗余,但通过基于注意力的剪枝系统性地消除这种冗余,无需训练即可显著提升推理基准测试的性能。
English: Recent large language models exhibit redundancy in reasoning paths, but systematically removing this redundancy through attention-based pruning significantly improves performance on reasoning benchmarks without requiring training.
Authors:Gwanhyeong Koo, Sunjae Yoon, Younghwan Lee, Ji Woo Hong, Chang D. Yoo
Abstract:
Drag-based editing allows precise object manipulation through point-based control, offering user convenience. However, current methods often suffer from a geometric inconsistency problem by focusing exclusively on matching user-defined points, neglecting the broader geometry and leading to artifacts or unstable edits. We propose FlowDrag, which leverages geometric information for more accurate and coherent transformations. Our approach constructs a 3D mesh from the image, using an energy function to guide mesh deformation based on user-defined drag points. The resulting mesh displacements are projected into 2D and incorporated into a UNet denoising process, enabling precise handle-to-target point alignment while preserving structural integrity. Additionally, existing drag-editing benchmarks provide no ground truth, making it difficult to assess how accurately the edits match the intended transformations. To address this, we present VFD (VidFrameDrag) benchmark dataset, which provides ground-truth frames using consecutive shots in a video dataset. FlowDrag outperforms existing drag-based editing methods on both VFD Bench and DragBench.
中文摘要:FlowDrag提出了一种基于拖拽的图像编辑方法,通过三维网格变形和用户引导点实现精确且几何一致的变换,同时推出了带有真实标注的VFD基准数据集以改进评估效果。
English Summary: FlowDrag introduces a method for drag-based image editing that uses 3D mesh deformation guided by user points to achieve precise and geometrically consistent transformations, while also presenting the VFD benchmark with ground-truth data for better evaluation.
Authors:Oleksandra Sobchyshak, Santiago Berrezueta-Guzman, Stefan Wagner
Abstract:
Unreal Engine is a platform that has influenced immersive storytelling and virtual reality (VR) through its advanced features and diverse applications. This paper provides an in-depth technical review of Unreal Engine. It analyzes its key innovations in creating hyper-realistic environments and emotionally engaging narratives, with significant applications in gaming, virtual production, education, cultural preservation, and healthcare. The findings of this article highlight Unreal Engine's transformative impact across industries, demonstrating its ability to merge storytelling with cutting-edge technologies. Case studies illustrate how Unreal Engine facilitates seamless visuals, audio, and interactivity integration to create compelling experiences. Additionally, this study identifies Unreal Engine's versatility in applications ranging from procedural content generation and AI-driven workflows to smart city simulations and VR-based rehabilitation programs.
While Unreal Engine sets new benchmarks for visual fidelity and interactivity, this paper underscores critical challenges, including its high hardware demands, limited accessibility, and ethical concerns related to over-immersion and data privacy. Addressing these challenges through cloud-based rendering, inclusive design, and ethical practices is essential for broader adoption and sustainability. This review concludes that Unreal Engine is suitable for innovation and interdisciplinary collaboration. Its ability to empower creators, redefine workflows, and push the boundaries of immersive storytelling positions Unreal Engine as pivotal in shaping the future of virtual reality and interactive media.
虚幻引擎通过其超逼真环境和跨行业应用革新了沉浸式叙事与虚拟现实,但也面临硬件要求高和伦理问题等挑战,需通过云端渲染与包容性设计推动更广泛采用。
Unreal Engine revolutionizes immersive storytelling and VR with its hyper-realistic environments and versatile applications across industries, though it faces challenges like high hardware demands and ethical concerns that require solutions for wider adoption.
Authors:Wei Yao, Shuzhao Xie, Letian Li, Weixiang Zhang, Zhixin Lai, Shiqi Dai, Ke Zhang, Zhi Wang
Abstract:
Current 4D Gaussian frameworks for dynamic scene reconstruction deliver impressive visual fidelity and rendering speed, however, the inherent trade-off between storage costs and the ability to characterize complex physical motions significantly limits the practical application of these methods. To tackle these problems, we propose SD-GS, a compact and efficient dynamic Gaussian splatting framework for complex dynamic scene reconstruction, featuring two key contributions. First, we introduce a deformable anchor grid, a hierarchical and memory-efficient scene representation where each anchor point derives multiple 3D Gaussians in its local spatiotemporal region and serves as the geometric backbone of the 3D scene. Second, to enhance modeling capability for complex motions, we present a deformation-aware densification strategy that adaptively grows anchors in under-reconstructed high-dynamic regions while reducing redundancy in static areas, achieving superior visual quality with fewer anchors. Experimental results demonstrate that, compared to state-of-the-art methods, SD-GS achieves an average of 60\% reduction in model size and an average of 100\% improvement in FPS, significantly enhancing computational efficiency while maintaining or even surpassing visual quality.
中文摘要:SD-GS通过可变形锚点网格和形变感知致密化策略,在保持更优画质的同时,将动态场景重建的模型体积缩小60%、渲染速度提升100%,实现了高效紧凑的复杂动态建模。
English Summary: SD-GS introduces a deformable anchor grid and deformation-aware densification to achieve compact dynamic scene reconstruction with 60% smaller models and 100% faster rendering while maintaining superior visual quality.
Authors:Xinyao Yu, Hao Sun, Zeyu Ling, Ziwei Niu, Zhenjia Bai, Rui Qin, Yen-Wei Chen, Lanfen Lin
Abstract:
In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1\% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.
近年来,大规模预训练多模态模型在文本-图像分类等任务中取得显著成功,但计算成本高昂;为此提出的EPIC提示交互策略通过时序提示和相似性整合,在减少资源消耗和参数的同时,在多个数据集上保持或提升了性能。
Large-scale pre-trained multimodal models achieve success in tasks like text-image classification but face high computational costs, leading to the development of EPIC, a prompt-based strategy that reduces resource use and parameters while maintaining or improving performance across multiple datasets.
Authors:Wanhua Li, Yujie Zhao, Minghan Qin, Yang Liu, Yuanhao Cai, Chuang Gan, Hanspeter Pfister
Abstract:
In this paper, we introduce LangSplatV2, which achieves high-dimensional feature splatting at 476.2 FPS and 3D open-vocabulary text querying at 384.6 FPS for high-resolution images, providing a 42 $\times$ speedup and a 47 $\times$ boost over LangSplat respectively, along with improved query accuracy. LangSplat employs Gaussian Splatting to embed 2D CLIP language features into 3D, significantly enhancing speed and learning a precise 3D language field with SAM semantics. Such advancements in 3D language fields are crucial for applications that require language interaction within complex scenes. However, LangSplat does not yet achieve real-time inference performance (8.2 FPS), even with advanced A100 GPUs, severely limiting its broader application. In this paper, we first conduct a detailed time analysis of LangSplat, identifying the heavyweight decoder as the primary speed bottleneck. Our solution, LangSplatV2 assumes that each Gaussian acts as a sparse code within a global dictionary, leading to the learning of a 3D sparse coefficient field that entirely eliminates the need for a heavyweight decoder. By leveraging this sparsity, we further propose an efficient sparse coefficient splatting method with CUDA optimization, rendering high-dimensional feature maps at high quality while incurring only the time cost of splatting an ultra-low-dimensional feature. Our experimental results demonstrate that LangSplatV2 not only achieves better or competitive query accuracy but is also significantly faster. Codes and demos are available at our project page: https://langsplat-v2.github.io.
中文:LangSplatV2通过稀疏编码技术实现了476.2 FPS的高维特征渲染和384.6 FPS的文本查询速度,在保持精度的同时比前代提速42倍,成功解决了实时3D语言交互的性能瓶颈。
English: LangSplatV2 achieves real-time 3D language interaction with speeds up to 476.2 FPS for feature splatting and 384.6 FPS for text querying, representing a 42x speedup over its predecessor while eliminating the decoder bottleneck through sparse coding optimization.
Authors:Wanhua Li, Yujie Zhao, Minghan Qin, Yang Liu, Yuanhao Cai, Chuang Gan, Hanspeter Pfister
Abstract:
In this paper, we introduce LangSplatV2, which achieves high-dimensional feature splatting at 476.2 FPS and 3D open-vocabulary text querying at 384.6 FPS for high-resolution images, providing a 42 $\times$ speedup and a 47 $\times$ boost over LangSplat respectively, along with improved query accuracy. LangSplat employs Gaussian Splatting to embed 2D CLIP language features into 3D, significantly enhancing speed and learning a precise 3D language field with SAM semantics. Such advancements in 3D language fields are crucial for applications that require language interaction within complex scenes. However, LangSplat does not yet achieve real-time inference performance (8.2 FPS), even with advanced A100 GPUs, severely limiting its broader application. In this paper, we first conduct a detailed time analysis of LangSplat, identifying the heavyweight decoder as the primary speed bottleneck. Our solution, LangSplatV2 assumes that each Gaussian acts as a sparse code within a global dictionary, leading to the learning of a 3D sparse coefficient field that entirely eliminates the need for a heavyweight decoder. By leveraging this sparsity, we further propose an efficient sparse coefficient splatting method with CUDA optimization, rendering high-dimensional feature maps at high quality while incurring only the time cost of splatting an ultra-low-dimensional feature. Our experimental results demonstrate that LangSplatV2 not only achieves better or competitive query accuracy but is also significantly faster. Codes and demos are available at our project page: https://langsplat-v2.github.io.
中文:LangSplatV2通过稀疏编码技术实现了476.2 FPS的高维特征渲染和384.6 FPS的文本查询速度,在保持精度的同时比前代提速42倍,成功解决了实时3D语言交互的性能瓶颈。
English: LangSplatV2 achieves real-time 3D language interaction with speeds up to 476.2 FPS for feature splatting and 384.6 FPS for text querying, representing a 42x speedup over its predecessor while eliminating the decoder bottleneck through sparse coding optimization.
Authors:Fan-Yun Sun, Shengguang Wu, Christian Jacobsen, Thomas Yim, Haoming Zou, Alex Zook, Shangru Li, Yu-Hsin Chou, Ethem Can, Xunlei Wu, Clemens Eppner, Valts Blukis, Jonathan Tremblay, Jiajun Wu, Stan Birchfield, Nick Haber
Abstract:
Despite large-scale pretraining endowing models with language and vision reasoning capabilities, improving their spatial reasoning capability remains challenging due to the lack of data grounded in the 3D world. While it is possible for humans to manually create immersive and interactive worlds through 3D graphics, as seen in applications such as VR, gaming, and robotics, this process remains highly labor-intensive. In this paper, we propose a scalable method for generating high-quality 3D environments that can serve as training data for foundation models. We recast 3D environment building as a sequential decision-making problem, employing Vision-Language-Models (VLMs) as policies that output actions to jointly craft a 3D environment's layout, materials, lighting, and assets. Our proposed framework, 3D-Generalist, trains VLMs to generate more prompt-aligned 3D environments via self-improvement fine-tuning. We demonstrate the effectiveness of 3D-Generalist and the proposed training strategy in generating simulation-ready 3D environments. Furthermore, we demonstrate its quality and scalability in synthetic data generation by pretraining a vision foundation model on the generated data. After fine-tuning the pre-trained model on downstream tasks, we show that it surpasses models pre-trained on meticulously human-crafted synthetic data and approaches results achieved with real data orders of magnitude larger.
中文: 本文提出的3D-Generalist框架通过视觉语言模型生成高质量3D环境作为训练数据,有效解决了人工创建成本高的问题,显著提升了基础模型的空间推理能力。
English: This paper introduces 3D-Generalist, a scalable framework that uses Vision-Language Models to generate high-quality 3D environments as training data, overcoming the limitations of manual creation and enhancing spatial reasoning in foundation models.
Authors:Shun Wang, Tyler Loakman, Youbo Lei, Yi Liu, Bohao Yang, Yuting Zhao, Dong Yang, Chenghua Lin
Abstract:
Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.
中文摘要:本研究采用基于字典学习和稀疏自编码器的方法分解大语言模型,从神经元中提取单义特征并实现提示的自动重构,显著提升了数学推理和隐喻检测等下游任务的性能。
English Summary: This study introduces a dictionary-learning method with sparse autoencoders to decompose LLMs, extracting monosemantic features from neurons and enabling automatic prompt reformulation, which significantly enhances performance in tasks like mathematical reasoning and metaphor detection.
Authors:Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, Xihui Liu
Abstract:
The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among components while maintaining robust structural cohesion. OmniPart uniquely decouples this complex task into two synergistic stages: (1) an autoregressive structure planning module generates a controllable, variable-length sequence of 3D part bounding boxes, critically guided by flexible 2D part masks that allow for intuitive control over part decomposition without requiring direct correspondences or semantic labels; and (2) a spatially-conditioned rectified flow model, efficiently adapted from a pre-trained holistic 3D generator, synthesizes all 3D parts simultaneously and consistently within the planned layout. Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications. Extensive experiments demonstrate that OmniPart achieves state-of-the-art performance, paving the way for more interpretable, editable, and versatile 3D content.
中文: OmniPart提出了一种双阶段框架,通过自回归结构规划和空间条件部件合成,实现了部件感知的3D对象生成,兼具高语义解耦与结构一致性。
English: OmniPart introduces a two-stage framework for generating part-aware 3D objects, enabling high semantic decoupling and structural cohesion through autoregressive structure planning and spatially-conditioned part synthesis.
Authors:Jaewan Park, Farid Ahmed, Kazuma Kobayashi, Seid Koric, Syed Bahauddin Alam, Iwona Jasiuk, Diab Abueidda
Abstract:
Video-diffusion models have recently set the standard in video generation, inpainting, and domain translation thanks to their training stability and high perceptual fidelity. Building on these strengths, we repurpose conditional video diffusion as a physics surrogate for spatio-temporal fields governed by partial differential equations (PDEs). Our two-stage surrogate first applies a Sequential Deep Operator Network (S-DeepONet) to produce a coarse, physics-consistent prior from the prescribed boundary or loading conditions. The prior is then passed to a conditional video diffusion model that learns only the residual: the point-wise difference between the ground truth and the S-DeepONet prediction. By shifting the learning burden from the full solution to its much smaller residual space, diffusion can focus on sharpening high-frequency structures without sacrificing global coherence. The framework is assessed on two disparate benchmarks: (i) vortex-dominated lid-driven cavity flow and (ii) tensile plastic deformation of dogbone specimens. Across these data sets the hybrid surrogate consistently outperforms its single-stage counterpart, cutting the mean relative L2 error from 4.57% to 0.83% for the flow problem and from 4.42% to 2.94% for plasticity, a relative improvements of 81.8% and 33.5% respectively. The hybrid approach not only lowers quantitative errors but also improves visual quality, visibly recovering fine spatial details. These results show that (i) conditioning diffusion on a physics-aware prior enables faithful reconstruction of localized features, (ii) residual learning reduces the problem, accelerating convergence and enhancing accuracy, and (iii) the same architecture transfers seamlessly from incompressible flow to nonlinear elasto-plasticity without problem-specific architectural modifications, highlighting its broad applicability to nonlinear, time-dependent continua.
Chinese: 研究人员开发了一种结合视频扩散模型和序列深度算子网络的混合物理代理方法,通过残差学习显著降低了时空偏微分方程解的误差并提升了视觉质量。
English: Researchers developed a hybrid physics surrogate using video diffusion models and sequential deep operator networks, which significantly reduces errors and improves visual quality by focusing on residual learning for spatio-temporal PDE solutions.
Authors:Quanzhu Niu, Yikang Zhou, Shihao Chen, Tao Zhang, Shunping Ji
Abstract:
Video Instance Segmentation (VIS) fundamentally struggles with pervasive challenges including object occlusions, motion blur, and appearance variations during temporal association. To overcome these limitations, this work introduces geometric awareness to enhance VIS robustness by strategically leveraging monocular depth estimation. We systematically investigate three distinct integration paradigms. Expanding Depth Channel (EDC) method concatenates the depth map as input channel to segmentation networks; Sharing ViT (SV) designs a uniform ViT backbone, shared between depth estimation and segmentation branches; Depth Supervision (DS) makes use of depth prediction as an auxiliary training guide for feature learning. Though DS exhibits limited effectiveness, benchmark evaluations demonstrate that EDC and SV significantly enhance the robustness of VIS. When with Swin-L backbone, our EDC method gets 56.2 AP, which sets a new state-of-the-art result on OVIS benchmark. This work conclusively establishes depth cues as critical enablers for robust video understanding.
Chinese: 本研究通过引入单目深度估计的几何感知来增强视频实例分割的鲁棒性,其中扩展深度通道和共享ViT方法在基准测试中取得了最优性能。
English: This study enhances Video Instance Segmentation robustness by integrating geometric awareness through monocular depth estimation, with the Expanding Depth Channel and Sharing ViT methods achieving state-of-the-art performance on benchmarks.
Authors:Ourui Fu, Hangzhou He, Xinliang Zhang, Lei Zhu, Shuang Zeng, ZhaoHeng Xie, Yanye Lu
Abstract:
The annotation bottleneck in semantic segmentation has driven significant interest in few-shot segmentation, which aims to develop segmentation models capable of generalizing rapidly to novel classes using minimal exemplars. Conventional training paradigms typically generate query prior maps by extracting masked-area features from support images, followed by making predictions guided by these prior maps. However, current approaches remain constrained by two critical limitations stemming from inter- and intra-image discrepancies, both of which significantly degrade segmentation performance: 1) The semantic gap between support and query images results in mismatched features and inaccurate prior maps; 2) Visually similar yet semantically distinct regions within support or query images lead to false negative or false positive predictions. We propose a novel FSS method called \textbf{I$^2$R}: 1) Using category-specific high level representations which aggregate global semantic cues from support and query images, enabling more precise inter-image region localization and address the first limitation. 2) Directional masking strategy that suppresses inconsistent support-query pixel pairs, which exhibit high feature similarity but conflicting mask, to mitigate the second issue. Experiments demonstrate that our method outperforms state-of-the-art approaches, achieving improvements of 1.9\% and 2.1\% in mIoU under the 1-shot setting on PASCAL-5$^i$ and COCO-20$^i$ benchmarks, respectively.
中文:提出的I²R方法通过利用全局语义线索和定向掩蔽策略,克服了少样本分割中的图像间语义差异和图像内视觉模糊问题,在基准数据集上实现了最先进的性能提升。
English: The proposed I²R method overcomes inter-image semantic gaps and intra-image visual ambiguities in few-shot segmentation by leveraging global semantic cues and a directional masking strategy, achieving state-of-the-art performance improvements on benchmark datasets.
Authors:Bole Liu, Li Qiao, Ye Wang, Zhen Gao, Yu Ma, Keke Ying, Tong Qin
Abstract:
With the emergence of 6G networks and proliferation of visual applications, efficient image transmission under adverse channel conditions is critical. We present a text-guided token communication system leveraging pre-trained foundation models for wireless image transmission with low bandwidth. Our approach converts images to discrete tokens, applies 5G NR polar coding, and employs text-guided token prediction for reconstruction. Evaluations on ImageNet show our method outperforms Deep Source Channel Coding with Attention Modules (ADJSCC) in perceptual quality and semantic preservation at Signal-to-Noise Ratios (SNRs) above 0 dB while mitigating the cliff effect at lower SNRs. Our system requires no scenario-specific retraining and exhibits superior cross-dataset generalization, establishing a new paradigm for efficient image transmission aligned with human perceptual priorities.
Chinese: 本文提出了一种基于文本引导的令牌通信系统,利用基础模型实现低带宽下的高效无线图像传输,在不同信号条件下无需重新训练即可在感知质量和语义保持方面超越现有方法。
English: This paper introduces a text-guided token communication system that uses foundation models to enable efficient wireless image transmission with low bandwidth, outperforming existing methods in perceptual quality and semantic preservation across various signal conditions without requiring retraining.
Authors:Litian Zhang, Xiaoming Zhang, Bingyu Yan, Ziyi Zhou, Bo Zhang, Zhenyu Guan, Xi Zhang, Chaozhuo Li
Abstract:
The exponential growth of social media and generative AI has transformed information dissemination, fostering connectivity but also accelerating the spread of misinformation. Understanding information propagation dynamics and developing effective control strategies is essential to mitigate harmful content. Traditional models, such as SIR, provide basic insights but inadequately capture the complexities of online interactions. Advanced methods, including attention mechanisms and graph neural networks, enhance accuracy but typically overlook user psychology and behavioral dynamics. Large language models (LLMs), with their human-like reasoning, offer new potential for simulating psychological aspects of information spread. We introduce an LLM-based simulation environment capturing agents' evolving attitudes, emotions, and responses. Initial experiments, however, revealed significant gaps between LLM-generated behaviors and authentic human dynamics, especially in stance detection and psychological realism. A detailed evaluation through Social Information Processing Theory identified major discrepancies in goal-setting and feedback evaluation, stemming from the lack of emotional processing in standard LLM training. To address these issues, we propose the Social Information Processing-based Chain of Thought (SIP-CoT) mechanism enhanced by emotion-guided memory. This method improves the interpretation of social cues, personalization of goals, and evaluation of feedback. Experimental results confirm that SIP-CoT-enhanced LLM agents more effectively process social information, demonstrating behaviors, attitudes, and emotions closer to real human interactions. In summary, this research highlights critical limitations in current LLM-based propagation simulations and demonstrates how integrating SIP-CoT and emotional memory significantly enhances the social intelligence and realism of LLM agents.
社交媒体的指数级增长和生成式人工智能改变了信息传播方式,既促进了连接也加速了错误信息的扩散,传统模型对此应对不足,而大型语言模型虽具潜力却在模拟人类行为和心理真实性上存在显著差距,因此提出了基于社会信息处理思维链机制并辅以情感引导记忆的方法,以提升LLM代理的社会智能和真实性。
The exponential growth of social media and generative AI has transformed information dissemination, fostering connectivity but also accelerating the spread of misinformation, necessitating advanced control strategies that traditional models inadequately address, while large language models (LLMs) offer new potential but reveal significant gaps in simulating human behaviors and psychological realism, which are addressed through the proposed Social Information Processing-based Chain of Thought (SIP-CoT) mechanism enhanced by emotion-guided memory to improve social intelligence and realism in LLM agents.
Authors:Mingzhe Li, Jing Xiang, Qishen Zhang, Kaiyang Wan, Xiuying Chen
Abstract:
Knowledge distillation typically involves transferring knowledge from a Large Language Model (LLM) to a Smaller Language Model (SLM). However, in tasks such as text matching, fine-tuned smaller models often yield more effective domain-specific representations, as they focus on optimizing the similarity of input pairs. To leverage both the specialized strengths of small models and the rich semantic understanding of LLMs, we introduce a flipped knowledge distillation paradigm, where LLM learns from SLM. Specifically, we address the architectural gap between decoder-only LLMs and smaller encoder-based models by reinterpreting LLMs in an encoder-decoder manner using LoRA. The encoder generates compressed representations, while the decoder maps them to the output space. During training, the encoder produces representations and their similarities, which are then aligned with the similarity scores produced by the teacher, using our proposed Margin-aware Contrastive Learning (MCL) approach. The MCL ensures accurate similarity for both positive and negative pairs, and adaptively handles the internal differences within positive and negative samples. Our paradigm requires only a reasonably good-performing SLM, allowing the LLM to achieve improved performance. Experiments on financial and healthcare benchmarks, as well as real-world applications, confirm its effectiveness, and the model has been fully deployed in an online environment.
中文摘要:本文提出了一种翻转知识蒸馏方法,让大语言模型向专业小模型学习,通过创新的边界感知对比学习技术弥合架构差异,在金融和医疗等专业领域的文本匹配任务中显著提升性能。
English Summary: This paper introduces a flipped knowledge distillation approach where Large Language Models learn from specialized Smaller Language Models, using a novel Margin-aware Contrastive Learning method to bridge architectural gaps and enhance performance in domain-specific tasks like text matching.
Authors:Xin Dong, Shichao Dong, Jin Wang, Jing Huang, Li Zhou, Zenghui Sun, Lihua Jing, Jingsong Lan, Xiaoyong Zhu, Bo Zheng
Abstract:
Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans' ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose \textbf{INTER}: \textbf{Inter}action Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4\% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.
Chinese: 该研究提出了一种无需训练的INTER算法,通过引导大型视觉语言模型更有效地利用多模态交互信息来减少幻觉,在基准测试中实现了高达3.4%的性能提升且无需额外数据。
English: The study introduces INTER, a training-free algorithm that mitigates hallucinations in large vision-language models by guiding them to better utilize multimodal interaction information, achieving up to 3.4% improvement on benchmarks without additional data.
Authors:Matteo Attimonelli, Alessandro De Bellis, Claudio Pomo, Dietmar Jannach, Eugenio Di Sciascio, Tommaso Di Noia
Abstract:
Pre-trained language models (PLMs) are widely used to derive semantic representations from item metadata in recommendation and search. In sequential recommendation, PLMs enhance ID-based embeddings through textual metadata, while in product search, they align item characteristics with user intent. Recent studies suggest task and domain-specific fine-tuning are needed to improve representational power. This paper challenges this assumption, showing that Generalist Text Embedding Models (GTEs), pre-trained on large-scale corpora, can guarantee strong zero-shot performance without specialized adaptation. Our experiments demonstrate that GTEs outperform traditional and fine-tuned models in both sequential recommendation and product search. We attribute this to a superior representational power, as they distribute features more evenly across the embedding space. Finally, we show that compressing embedding dimensions by focusing on the most informative directions (e.g., via PCA) effectively reduces noise and improves the performance of specialized models. To ensure reproducibility, we provide our repository at https://split.to/gte4ps.
中文: 通用文本嵌入模型(GTEs)通过在嵌入空间中更均匀地分布特征,在序列推荐和产品搜索中实现了卓越的零样本性能,无需专门调优即可超越专业模型。
English: Generalist Text Embedding Models (GTEs) achieve superior zero-shot performance in sequential recommendation and product search by distributing features more evenly across the embedding space, outperforming specialized models without requiring fine-tuning.
Authors:Giulio Schiavi, Andrei Cramariuc, Lionel Ott, Roland Siegwart
Abstract:
Human guidance has emerged as a powerful tool for enhancing reinforcement learning (RL). However, conventional forms of guidance such as demonstrations or binary scalar feedback can be challenging to collect or have low information content, motivating the exploration of other forms of human input. Among these, relative feedback (i.e., feedback on how to improve an action, such as "more to the left") offers a good balance between usability and information richness. Previous research has shown that relative feedback can be used to enhance policy search methods. However, these efforts have been limited to specific policy classes and use feedback inefficiently. In this work, we introduce a novel method to learn from relative feedback and combine it with off-policy reinforcement learning. Through evaluations on two sparse-reward tasks, we demonstrate our method can be used to improve the sample efficiency of reinforcement learning by guiding its exploration process. Additionally, we show it can adapt a policy to changes in the environment or the user's preferences. Finally, we demonstrate real-world applicability by employing our approach to learn a navigation policy in a sparse reward setting.
中文: 本文提出一种将相对人类反馈与离策略强化学习相结合的新方法,在稀疏奖励任务中提高了样本效率与策略适应性,并通过导航策略验证了实际应用价值。
English: This paper introduces a novel method that integrates relative human feedback with off-policy reinforcement learning to improve sample efficiency and adaptability in sparse-reward tasks, demonstrating real-world applicability in navigation policies.
Authors:Tao Zhang, Shiqing Wei, Shihao Chen, Wenling Yu, Muying Luo, Shunping Ji
Abstract:
Automatically extracting vectorized building contours from remote sensing imagery is crucial for urban planning, population estimation, and disaster assessment. Current state-of-the-art methods rely on complex multi-stage pipelines involving pixel segmentation, vectorization, and polygon refinement, which limits their scalability and real-world applicability. Inspired by the remarkable reasoning capabilities of Large Language Models (LLMs), we introduce VectorLLM, the first Multi-modal Large Language Model (MLLM) designed for regular building contour extraction from remote sensing images. Unlike existing approaches, VectorLLM performs corner-point by corner-point regression of building contours directly, mimicking human annotators' labeling process. Our architecture consists of a vision foundation backbone, an MLP connector, and an LLM, enhanced with learnable position embeddings to improve spatial understanding capability. Through comprehensive exploration of training strategies including pretraining, supervised fine-tuning, and preference optimization across WHU, WHU-Mix, and CrowdAI datasets, VectorLLM significantly outperformed the previous SOTA methods by 5.6 AP, 7.1 AP, 13.6 AP, respectively in the three datasets. Remarkably, VectorLLM exhibits strong zero-shot performance on unseen objects including aircraft, water bodies, and oil tanks, highlighting its potential for unified modeling of diverse remote sensing object contour extraction tasks. Overall, this work establishes a new paradigm for vector extraction in remote sensing, leveraging the topological reasoning capabilities of LLMs to achieve both high accuracy and exceptional generalization. All the codes and weights will be published for promoting community development.
Chinese: VectorLLM是一种创新的多模态大语言模型,通过角点逐点回归直接从遥感图像中提取建筑轮廓,在多个数据集上实现最优性能,并对未见物体展现出强大的泛化能力。
English: VectorLLM is a novel multi-modal large language model that directly extracts building contours from remote sensing images through corner-point regression, achieving state-of-the-art performance across multiple datasets and demonstrating strong generalization to unseen objects.
Authors:Sadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash, Matthias Grossglauser
Abstract:
Hierarchical reinforcement learning (HRL) improves the efficiency of long-horizon reinforcement-learning tasks with sparse rewards by decomposing the task into a hierarchy of subgoals. The main challenge of HRL is efficient discovery of the hierarchical structure among subgoals and utilizing this structure to achieve the final goal. We address this challenge by modeling the subgoal structure as a causal graph and propose a causal discovery algorithm to learn it. Additionally, rather than intervening on the subgoals at random during exploration, we harness the discovered causal model to prioritize subgoal interventions based on their importance in attaining the final goal. These targeted interventions result in a significantly more efficient policy in terms of the training cost. Unlike previous work on causal HRL, which lacked theoretical analysis, we provide a formal analysis of the problem. Specifically, for tree structures and, for a variant of ErdÅs-Rényi random graphs, our approach results in remarkable improvements. Our experimental results on HRL tasks also illustrate that our proposed framework outperforms existing work in terms of training cost.
中文: 分层强化学习通过将子目标关系建模为因果图,并利用因果发现进行针对性干预,显著提高了长期任务的训练效率和性能。
English: Hierarchical reinforcement learning (HRL) enhances efficiency in long-horizon tasks by modeling subgoal relationships as a causal graph and using targeted interventions based on causal discovery, leading to significant improvements in training cost and performance.
Authors:Ze Yuan, Xin Yu, Yangtian Sun, Yuan-Chen Guo, Yan-Pei Cao, Ding Liang, Xiaojuan Qi
Abstract:
Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps -- an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.
中文摘要:SeqTex是一种端到端框架,通过将任务重新定义为序列生成问题,利用预训练视频基础模型直接生成完整的UV纹理贴图,无需后处理即可在3D纹理生成中实现最优性能。
English Summary: SeqTex is an end-to-end framework that leverages pretrained video foundation models to directly generate complete UV texture maps by reformulating the task as a sequence generation problem, eliminating post-processing needs while achieving state-of-the-art performance in 3D texture generation.
Authors:Dapeng Jiang, Xiangzhe Kong, Jiaqi Han, Mingyu Li, Rui Jiao, Wenbing Huang, Stefano Ermon, Jianzhu Ma, Yang Liu
Abstract:
Cyclic peptides, characterized by geometric constraints absent in linear peptides, offer enhanced biochemical properties, presenting new opportunities to address unmet medical needs. However, designing target-specific cyclic peptides remains underexplored due to limited training data. To bridge the gap, we propose CP-Composer, a novel generative framework that enables zero-shot cyclic peptide generation via composable geometric constraints. Our approach decomposes complex cyclization patterns into unit constraints, which are incorporated into a diffusion model through geometric conditioning on nodes and edges. During training, the model learns from unit constraints and their random combinations in linear peptides, while at inference, novel constraint combinations required for cyclization are imposed as input. Experiments show that our model, despite trained with linear peptides, is capable of generating diverse target-binding cyclic peptides, reaching success rates from 38% to 84% on different cyclization strategies.
Chinese: CP-Composer 是一种生成框架,通过可组合的几何约束在扩散模型中实现零样本生成多样化的靶向结合环肽,尽管仅在线性肽上训练,成功率仍达到 38% 至 84%。
English: CP-Composer is a generative framework that uses composable geometric constraints in a diffusion model to enable zero-shot generation of diverse target-binding cyclic peptides, achieving success rates of 38% to 84% despite being trained only on linear peptides.
Authors:Runar Helin, Ole-Christoffer Granmo, Mayur Kishor Shende, Lei Jiao, Vladimir I. Zadorozhny, Kunal Ganesh Dumbre, Rishad Shafik, Alex Yakovlev
Abstract:
Data modeling using Tsetlin machines (TMs) is all about building logical rules from the data features. The decisions of the model are based on a combination of these logical rules. Hence, the model is fully transparent and it is possible to get explanations of its predictions. In this paper, we present a probability score for TM predictions and develop new techniques for uncertainty quantification to increase the explainability further. The probability score is an inherent property of any TM variant and is derived through an analysis of the TM learning dynamics. Simulated data is used to show a clear connection between the learned TM probability scores and the underlying probabilities of the data. A visualization of the probability scores also reveals that the TM is less confident in its predictions outside the training data domain, which contrasts the typical extrapolation phenomenon found in Artificial Neural Networks. The paper concludes with an application of the uncertainty quantification techniques on an image classification task using the CIFAR-10 dataset, where they provide new insights and suggest possible improvements to current TM image classification models.
Chinese Summary: 本文为Tsetlin机器预测引入概率评分并开发不确定性量化技术,通过模拟数据和CIFAR-10图像分类任务验证了这些方法能有效提升模型可解释性,同时揭示了模型在训练数据域外预测置信度降低的特性。
English Summary: This paper introduces a probability score for Tsetlin machine predictions and develops uncertainty quantification techniques to enhance model explainability, demonstrating their effectiveness through simulated data and image classification on the CIFAR-10 dataset.
Authors:Aleksandr Gushchin, Maksim Smirnov, Dmitriy Vatolin, Anastasia Antsiferova
Abstract:
We propose the LEHA-CVQAD (Large-scale Enriched Human-Annotated Compressed Video Quality Assessment) dataset, which comprises 6,240 clips for compression-oriented video quality assessment. 59 source videos are encoded with 186 codec-preset variants, 1.8M pairwise, and 1.5k MOS ratings are fused into a single quality scale; part of the videos remains hidden for blind evaluation. We also propose Rate-Distortion Alignment Error (RDAE), a novel evaluation metric that quantifies how well VQA models preserve bitrate-quality ordering, directly supporting codec parameter tuning. Testing IQA/VQA methods reveals that popular VQA metrics exhibit high RDAE and lower correlations, underscoring the dataset challenges and utility. The open part and the results of LEHA-CVQAD are available at https://aleksandrgushchin.github.io/lcvqad/
中文:LEHA-CVQAD数据集包含6,240个压缩视频片段及新型RDAE评估指标,通过测试发现现有视频质量评估方法存在明显不足,为编解码器参数优化提供了重要数据支撑。
English: The LEHA-CVQAD dataset introduces 6,240 compressed video clips with extensive annotations and a novel RDAE metric to evaluate video quality models, revealing limitations in current methods while providing resources for codec optimization.
Authors:Shehroz S. Khan, Petar Przulj, Ahmed Ashraf, Ali Abedi
Abstract:
The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists' capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists' workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.
Chinese: ChestGPT是一种深度学习框架,通过整合视觉与语言变换器,在胸部X光图像中实现疾病分类和病灶区域定位,其优异表现为放射科医生提供了减轻工作负担的辅助工具。
English: ChestGPT is a deep-learning framework that combines vision and language transformers to classify diseases and localize regions of interest in chest X-rays, achieving strong performance and offering an assistive tool to reduce radiologists' workload.
Authors:Sang Quang Nguyen, Kiet Van Nguyen, Vinh-Tiep Nguyen, Thanh Duc Ngo, Ngan Luu-Thuy Nguyen, Duy-Dinh Le
Abstract:
In this paper, we explore the ability of large language models (LLMs) to plan and make decisions through the lens of the traditional Vietnamese board game, à Än Quan. This game, which involves a series of strategic token movements and captures, offers a unique environment for evaluating the decision-making and strategic capabilities of LLMs. Specifically, we develop various agent personas, ranging from aggressive to defensive, and employ the à Än Quan game as a testbed for assessing LLM performance across different strategies. Through experimentation with models like Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Llama-3.3-70B-Instruct, we aim to understand how these models execute strategic decision-making, plan moves, and manage dynamic game states. The results will offer insights into the strengths and weaknesses of LLMs in terms of reasoning and strategy, contributing to a deeper understanding of their general capabilities.
本研究通过越南传统棋盘游戏Ô Ăn Quan评估大语言模型的战略规划能力,测试不同智能体角色以揭示其在决策推理中的优势与局限。
This study evaluates the strategic planning abilities of large language models using the traditional Vietnamese game Ô Ăn Quan, testing various agent personas to uncover their reasoning strengths and limitations in decision-making.
Authors:Kazuma Kobayashi, Jaewan Park, Qibang Liu, Seid Koric, Diab Abueidda, Syed Bahauddin Alam
Abstract:
Scientific applications increasingly demand real-time surrogate models that can capture the behavior of strongly coupled multiphysics systems driven by multiple input functions, such as in thermo-mechanical and electro-thermal processes. While neural operator frameworks, such as Deep Operator Networks (DeepONets), have shown considerable success in single-physics settings, their extension to multiphysics problems remains poorly understood. In particular, the challenge of learning nonlinear interactions between tightly coupled physical fields has received little systematic attention. This study addresses a foundational question: should the architectural design of a neural operator reflect the strength of physical coupling it aims to model? To answer this, we present the first comprehensive, architecture-aware evaluation of DeepONet variants across three regimes: single-physics, weakly coupled, and strongly coupled multiphysics systems. We consider a reaction-diffusion equation with dual spatial inputs, a nonlinear thermo-electrical problem with bidirectional coupling through temperature-dependent conductivity, and a viscoplastic thermo-mechanical model of steel solidification governed by transient phase-driven interactions. Two operator-learning frameworks, the classical DeepONet and its sequential GRU-based extension, S-DeepONet, are benchmarked using both single-branch and multi-branch (MIONet-style) architectures. Our results demonstrate that architectural alignment with physical coupling is crucial: single-branch networks significantly outperform multi-branch counterparts in strongly coupled settings, whereas multi-branch encodings offer advantages for decoupled or single-physics problems. Once trained, these surrogates achieve full-field predictions up to 1.8e4 times faster than high-fidelity finite-element solvers, without compromising solution accuracy.
中文: 本研究表明,神经网络算子架构与物理耦合强度的匹配至关重要:单分支网络在强耦合多物理场系统中表现卓越,而多分支设计适用于解耦问题,可在不损失精度的情况下实现比传统求解器快高达1.8万倍的预测速度。
English: This study demonstrates that aligning neural operator architectures with the strength of physical coupling is essential, as single-branch networks excel in strongly coupled multiphysics systems while multi-branch designs benefit decoupled problems, achieving predictions up to 18,000 times faster than traditional solvers without losing accuracy.
Authors:Xingyu Zhu, Shuo Wang, Beier Zhu, Miaoge Li, Yunfan Li, Junfeng Fang, Zhicai Wang, Dongsheng Wang, Hanwang Zhang
Abstract:
With the increasing attention to pre-trained vision-language models (VLMs), \eg, CLIP, substantial efforts have been devoted to many downstream tasks, especially in test-time adaptation (TTA). However, previous works focus on learning prototypes only in the textual modality while overlooking the ambiguous semantics in class names. These ambiguities lead to textual prototypes that are insufficient to capture visual concepts, resulting in limited performance. To address this issue, we introduce \textbf{ProtoMM}, a training-free framework that constructs multimodal prototypes to adapt VLMs during the test time. By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning. More importantly, the visual particles are dynamically updated as the testing stream flows. This allows our multimodal prototypes to continually learn from the data, enhancing their generalizability in unseen scenarios. In addition, we quantify the importance of the prototypes and test images by formulating their semantic distance as an optimal transport problem. Extensive experiments on 15 zero-shot benchmarks demonstrate the effectiveness of our method, achieving a 1.03\% average accuracy improvement over state-of-the-art methods on ImageNet and its variant datasets.
中文:ProtoMM提出了一种无需训练的框架,通过整合文本和视觉特征构建多模态原型,并在测试过程中动态更新,从而提升泛化能力,在零样本基准测试中实现了更优的准确率。
English: ProtoMM introduces a training-free framework that creates multimodal prototypes by integrating textual and visual features, dynamically updating them during testing to improve generalization and achieve superior accuracy on zero-shot benchmarks.
Authors:Xiangxiang Li, Haiyan Wang, Yao Ge, Xiaohong Shen, Yong Liang Guan, Miaowen Wen, Chau Yuen
Abstract:
The recently proposed affine frequency division multiplexing (AFDM) modulation has been considered as a promising technology for narrowband doubly-dispersive channels. However, the time-scaling effects, i.e., pulse widening and pulse shortening phenomena, in extreme wideband doubly-dispersive channels have not been considered in the literatures. In this paper, we investigate such wideband transmission and develop an efficient transmission structure with chirp-periodic prefix (CPP) and chirp-periodic suffix (CPS) for AFDM system. We derive the input-output relationship of AFDM system under time-scaled wideband doubly-dispersive channels and demonstrate the sparsity in discrete affine Fourier (DAF) domain equivalent channels. We further optimize the AFDM chirp parameters to accommodate the time-scaling characteristics in wideband doubly-dispersive channels and verify the superiority of the derived chirp parameters by pairwise error probability (PEP) analysis. We also develop an efficient cross domain distributed orthogonal approximate message passing (CD-D-OAMP) algorithm for AFDM symbol detection and analyze its corresponding state evolution. By analyzing the detection complexity of CD-D-OAMP detector and evaluating the error performance of AFDM systems based on simulations, we demonstrate that the AFDM system with our optimized chirp parameters outperforms the existing competitive modulation schemes in time-scaled wideband doubly-dispersive channels. Moreover, our proposed CD-D-OAMP detector can achieve the desirable trade-off between the complexity and performance, while supporting parallel computing to significantly reduce the computational latency.
Chinese: 本研究提出了一种改进的仿射频分复用系统,通过优化线性调频参数和跨域检测算法,在宽带双弥散信道中展现出优于现有调制方案的性能。
English: The study introduces an enhanced affine frequency division multiplexing (AFDM) system with optimized chirp parameters and a cross-domain detection algorithm, demonstrating superior performance in wideband doubly-dispersive channels compared to existing schemes.
Authors:Dulaji Hidellaarachchi, John Grundy, Rashina Hoda
Abstract:
Humour has long been recognized as a key factor in enhancing creativity, group effectiveness, and employee well-being across various domains. However, its occurrence and impact within software engineering (SE) teams remains under-explored. This paper introduces a comprehensive, literature review-based taxonomy exploring the characterisation and use of humour in SE teams, with the goal of boosting productivity, improving communication, and fostering a positive work environment while emphasising the responsible use of humour to mitigate its potential negative impacts. Drawing from a wide array of studies in psychology, sociology, and organizational behaviour, our proposed framework categorizes humour into distinct theories, styles, models, and scales, offering SE professionals and researchers a structured approach to understanding humour in their work. This study also addresses the unique challenges of applying humour in SE, highlighting its potential benefits while acknowledging the need for further empirical validation in this context. Ultimately, our study aims to pave the way for more cohesive, creative, and psychologically supportive SE environments through the strategic use of humour.
中文: 本文提出了一种用于理解幽默在软件工程团队中作用的分类法,旨在提高生产力和营造积极工作环境,同时强调负责任地运用幽默并需进一步实证验证。
English: This paper presents a taxonomy for understanding humor's role in software engineering teams, aiming to enhance productivity and foster positive work environments while addressing its responsible application and need for further empirical validation.
Authors:Song Mao, Yang Chen, Pinglong Cai, Ding Wang, Guohang Yan, Zhi Yu, Botian Shi
Abstract:
Multimodal Large Language Models (MLLMs) increasingly adopt multiple vision encoders to capture diverse visual information, ranging from coarse semantics to fine grained details. While this approach is intended to enhance visual understanding capability, we observe that the performance gains from adding encoders often diminish and can even lead to performance degradation, a phenomenon we term encoder redundancy. This paper presents a systematic investigation into this issue. Through comprehensive ablation studies on state of the art multi encoder MLLMs, we empirically demonstrate that significant redundancy exists. To quantify each encoder's unique contribution, we propose a principled metric: the Conditional Utilization Rate (CUR). Building on CUR, we introduce the Information Gap (IG) to capture the overall disparity in encoder utility within a model.Our experiments reveal that certain vision encoders contribute little, or even negatively, to overall performance, confirming substantial redundancy. Our experiments reveal that certain vision encoders contribute minimally, or even negatively, to the model's performance, confirming the prevalence of redundancy. These findings highlight critical inefficiencies in current multi encoder designs and establish that our proposed metrics can serve as valuable diagnostic tools for developing more efficient and effective multimodal architectures.
中文: 多模态大语言模型采用多个视觉编码器时普遍存在编码器冗余现象,新增编码器不仅收益递减甚至可能损害性能,本文提出的条件利用率与信息差距指标可有效量化这种冗余。
English: Multimodal Large Language Models using multiple vision encoders often suffer from encoder redundancy, where additional encoders provide diminishing returns or even degrade performance, as measured by the proposed Conditional Utilization Rate and Information Gap metrics.
Authors:Hetvi Shastri, Walid A. Hanafy, Li Wu, David Irwin, Mani Srivastava, Prashant Shenoy
Abstract:
Today's Internet of Things (IoT) has evolved from simple sensing and actuation devices to those with embedded processing and intelligent services, enabling rich collaborations between users and their devices. However, enabling such collaboration becomes challenging when transient devices need to interact with host devices in temporarily visited environments. In such cases, fine-grained access control policies are necessary to ensure secure interactions; however, manually implementing them is often impractical for non-expert users. Moreover, at run-time, the system must automatically configure the devices and enforce such fine-grained access control rules. Additionally, the system must address the heterogeneity of devices.
In this paper, we present CollabIoT, a system that enables secure and seamless device collaboration in transient IoT environments. CollabIoT employs a Large language Model (LLM)-driven approach to convert users' high-level intents to fine-grained access control policies. To support secure and seamless device collaboration, CollabIoT adopts capability-based access control for authorization and uses lightweight proxies for policy enforcement, providing hardware-independent abstractions.
We implement a prototype of CollabIoT's policy generation and auto configuration pipelines and evaluate its efficacy on an IoT testbed and in large-scale emulated environments. We show that our LLM-based policy generation pipeline is able to generate functional and correct policies with 100% accuracy. At runtime, our evaluation shows that our system configures new devices in ~150 ms, and our proxy-based data plane incurs network overheads of up to 2 ms and access control overheads up to 0.3 ms.
Chinese: CollabIoT系统采用大语言模型将用户高级意图自动转化为细粒度访问控制策略,在瞬态物联网环境中实现安全无缝的设备协作,同时保持较低的配置和访问控制开销。
English: CollabIoT is a system that uses a Large Language Model to automatically convert user intents into fine-grained access control policies, enabling secure and seamless device collaboration in transient IoT environments with minimal configuration and access control overhead.
Authors:Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Adrian Mladenic Grobelnik, Nurendra Choudhary, Eddie Huang, Karthik Subbian, Linjun Zhang, Diyi Yang, James Zou, Jure Leskovec
Abstract:
Compound AI systems integrating multiple components, such as Large Language Models, specialized tools, and traditional machine learning models, are increasingly deployed to solve complex real-world tasks. However, optimizing compound systems remains challenging due to their non-differentiable structures and diverse configuration types across components, including prompts, hyperparameters, and model parameters. To address this challenge, we propose Optimas, a unified framework for effective optimization of compound systems. The core idea of Optimas is to maintain one Local Reward Function (LRF) per component, each satisfying a local-global alignment property, i.e., each component's local reward correlates with the global system performance. In each iteration, Optimas efficiently adapts the LRFs to maintain this property while simultaneously maximizing each component's local reward. This approach enables independent updates of heterogeneous configurations using the designated optimization method, while ensuring that local improvements consistently lead to performance gains. We present extensive evaluations across five real-world compound systems to demonstrate that Optimas outperforms strong baselines by an average improvement of 11.92%, offering a general and effective approach for improving compound systems. Our website is at https://optimas.stanford.edu.
中文摘要:Optimas框架通过为复合AI系统中的每个组件引入局部奖励函数,在保持局部与全局性能一致性的前提下实现各组件独立优化,从而有效提升整体系统表现。
English Summary: The Optimas framework introduces Local Reward Functions for each component in compound AI systems, enabling independent optimization while ensuring local improvements consistently enhance overall performance.
Authors:Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Adrian Mladenic Grobelnik, Nurendra Choudhary, Eddie Huang, Karthik Subbian, Linjun Zhang, Diyi Yang, James Zou, Jure Leskovec
Abstract:
Compound AI systems integrating multiple components, such as Large Language Models, specialized tools, and traditional machine learning models, are increasingly deployed to solve complex real-world tasks. However, optimizing compound systems remains challenging due to their non-differentiable structures and diverse configuration types across components, including prompts, hyperparameters, and model parameters. To address this challenge, we propose Optimas, a unified framework for effective optimization of compound systems. The core idea of Optimas is to maintain one Local Reward Function (LRF) per component, each satisfying a local-global alignment property, i.e., each component's local reward correlates with the global system performance. In each iteration, Optimas efficiently adapts the LRFs to maintain this property while simultaneously maximizing each component's local reward. This approach enables independent updates of heterogeneous configurations using the designated optimization method, while ensuring that local improvements consistently lead to performance gains. We present extensive evaluations across five real-world compound systems to demonstrate that Optimas outperforms strong baselines by an average improvement of 11.92%, offering a general and effective approach for improving compound systems. Our website is at https://optimas.stanford.edu.
中文摘要:Optimas框架通过为复合AI系统中的每个组件引入局部奖励函数,在保持局部与全局性能一致性的前提下实现各组件独立优化,从而有效提升整体系统表现。
English Summary: The Optimas framework introduces Local Reward Functions for each component in compound AI systems, enabling independent optimization while ensuring local improvements consistently enhance overall performance.
Authors:Chaozhuo Li, Pengbo Wang, Chenxu Wang, Litian Zhang, Zheng Liu, Qiwei Ye, Yuanbo Xu, Feiran Huang, Xi Zhang, Philip S. Yu
Abstract:
Edgar Allan Poe noted, "Truth often lurks in the shadow of error," highlighting the deep complexity intrinsic to the interplay between truth and falsehood, notably under conditions of cognitive and informational asymmetry. This dynamic is strikingly evident in large language models (LLMs). Despite their impressive linguistic generation capabilities, LLMs sometimes produce information that appears factually accurate but is, in reality, fabricated, an issue often referred to as 'hallucinations'. The prevalence of these hallucinations can mislead users, affecting their judgments and decisions. In sectors such as finance, law, and healthcare, such misinformation risks causing substantial economic losses, legal disputes, and health risks, with wide-ranging consequences.In our research, we have methodically categorized, analyzed the causes, detection methods, and solutions related to LLM hallucinations. Our efforts have particularly focused on understanding the roots of hallucinations and evaluating the efficacy of current strategies in revealing the underlying logic, thereby paving the way for the development of innovative and potent approaches. By examining why certain measures are effective against hallucinations, our study aims to foster a comprehensive approach to tackling this issue within the domain of LLMs.
Chinese: 大型语言模型有时会产生被称为“幻觉”的虚构信息,可能误导用户并在关键领域造成重大风险,因此研究其成因、检测方法和解决方案,以开发有效的应对策略。
English: Large language models sometimes generate fabricated information known as "hallucinations," which can mislead users and cause significant risks in critical sectors, prompting research into their causes, detection, and solutions to develop effective countermeasures.
Authors:Xin Guan, PeiHsin Lin, Zekun Wu, Ze Wang, Ruibo Zhang, Emre Kazim, Adriano Koshiyama
Abstract:
Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the HR baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.
中文: MPF是一种可扩展的后训练对齐框架,通过多视角融合将大语言模型的输出与可解释的人类基线对齐,无需提示工程或微调即可有效减轻偏见。
English: MPF is a scalable post-training framework that mitigates biases in LLMs by aligning their outputs with interpretable human baselines through multiperspective fusion, requiring no prompt engineering or fine-tuning.
Authors:Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen
Abstract:
Current deep neural network (DNN) based speech separation faces a fundamental challenge -- while the models need to be trained on short segments due to computational constraints, real-world applications typically require processing significantly longer recordings with multiple utterances per speaker than seen during training. In this paper, we investigate how existing approaches perform in this challenging scenario and propose a frequency-temporal recurrent neural network (FTRNN) that effectively bridges this gap. Our FTRNN employs a full-band module to model frequency dependencies within each time frame and a sub-band module that models temporal patterns in each frequency band. Despite being trained on short fixed-length segments of 10 s, our model demonstrates robust separation when processing signals significantly longer than training segments (21-121 s) and preserves speaker association across utterance gaps exceeding those seen during training. Unlike the conventional segment-separation-stitch paradigm, our lightweight approach (0.9 M parameters) performs inference on long audio without segmentation, eliminating segment boundary distortions while simplifying deployment. Experimental results demonstrate the generalization ability of FTRNN for multi-utterance speech separation and speaker association.
中文摘要:现有深度神经网络语音分离模型因训练片段短而难以处理长录音,我们提出的频率-时序循环神经网络通过分别建模频率依赖性和时序模式,无需分段即可实现长音频的稳健分离并保持说话人关联。
English Summary: Current DNN speech separation models struggle with processing long recordings due to training on short segments, but our proposed Frequency-Temporal RNN effectively bridges this gap by modeling frequency and temporal dependencies separately, enabling robust separation of longer audio without segmentation.
Authors:Subhasis Dasgupta, Jaydip Sen, Tuhina Halder
Abstract:
Structural crack detection is a critical task for public safety as it helps in preventing potential structural failures that could endanger lives. Manual detection by inexperienced personnel can be slow, inconsistent, and prone to human error, which may compromise the reliability of assessments. The current study addresses these challenges by introducing a novel deep-learning architecture designed to enhance the accuracy and efficiency of structural crack detection. In this research, various configurations of residual U-Net models were utilized. These models, due to their robustness in capturing fine details, were further integrated into an ensemble with a meta-model comprising convolutional blocks. This unique combination aimed to boost prediction efficiency beyond what individual models could achieve. The ensemble's performance was evaluated against well-established architectures such as SegNet and the traditional U-Net. Results demonstrated that the residual U-Net models outperformed their predecessors, particularly with low-resolution imagery, and the ensemble model exceeded the performance of individual models, proving it as the most effective. The assessment was based on the Intersection over Union (IoU) metric and DICE coefficient. The ensemble model achieved the highest scores, signifying superior accuracy. This advancement suggests way for more reliable automated systems in structural defects monitoring tasks.
中文: 本研究提出了一种结合残差U-Net架构与元模型的新型深度学习集成模型,在低分辨率图像处理中表现出色,其裂缝检测精度和效率均超越现有方法。
English: This study introduces a novel deep-learning ensemble model combining residual U-Net architectures with a meta-model, which demonstrated superior crack detection accuracy and efficiency compared to existing methods, particularly with low-resolution images.
Authors:Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, Cordelia Schmid
Abstract:
Despite recent advances in Vision-Language Models (VLMs), long-video understanding remains a challenging problem. Although state-of-the-art long-context VLMs can process around 1000 input frames, they still struggle to effectively leverage this sequence length, and succumb to irrelevant distractors within the context window. We present Temporal Chain of Thought, an inference strategy for video question-answering that curates the model's input context. We use the VLM itself to iteratively identify and extract the most relevant frames from the video, which are then used for answering. We demonstrate how leveraging more computation at inference-time to select the most relevant context leads to improvements in accuracy, in agreement with recent work on inference-time scaling of LLMs. Moreover, we achieve state-of-the-art results on 4 diverse video question-answering datasets, showing consistent improvements with 3 different VLMs. In particular, our method shines on longer videos which would not otherwise fit within the model's context window: On longer videos of more than 1 hour on LVBench, our approach using a context window of 32K outperforms the same VLM using standard inference with a 700K context window by 2.8 points.
中文: 本文提出时序思维链推理策略,通过视觉语言模型迭代筛选关键视频帧来提升长视频问答性能,在多个数据集上实现了最先进的结果。
English: The paper introduces Temporal Chain of Thought, an inference strategy that uses the VLM to iteratively select relevant video frames for improved long-video question-answering, achieving state-of-the-art results across multiple datasets.
Authors:Yi Ru Wang, Carter Ung, Grant Tannert, Jiafei Duan, Josephine Li, Amy Le, Rishabh Oswal, Markus Grotz, Wilbert Pumacay, Yuquan Deng, Ranjay Krishna, Dieter Fox, Siddhartha Srinivasa
Abstract:
We present RoboEval, a simulation benchmark and structured evaluation framework designed to reveal the limitations of current bimanual manipulation policies. While prior benchmarks report only binary task success, we show that such metrics often conceal critical weaknesses in policy behavior -- such as poor coordination, slipping during grasping, or asymmetric arm usage. RoboEval introduces a suite of tiered, semantically grounded tasks decomposed into skill-specific stages, with variations that systematically challenge spatial, physical, and coordination capabilities. Tasks are paired with fine-grained diagnostic metrics and 3000+ human demonstrations to support imitation learning. Our experiments reveal that policies with similar success rates diverge in how tasks are executed -- some struggle with alignment, others with temporally consistent bimanual control. We find that behavioral metrics correlate with success in over half of task-metric pairs, and remain informative even when binary success saturates. By pinpointing when and how policies fail, RoboEval enables a deeper, more actionable understanding of robotic manipulation -- and highlights the need for evaluation tools that go beyond success alone.
Chinese: RoboEval是一个模拟基准和结构化评估框架,通过分层任务和精细指标揭示双手操作策略的关键缺陷,超越二元成功率,为策略失败提供可操作的深入理解。
English: RoboEval is a simulation benchmark and structured evaluation framework that uncovers critical weaknesses in bimanual manipulation policies through tiered tasks and fine-grained metrics, moving beyond binary success rates to provide actionable insights into policy failures.
Authors:Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue
Abstract:
Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.
Chinese: 研究表明,尽管许多语言模型在数学基准测试中表现出色,但其能力无法迁移到科学或编程等其他领域,而强化学习相比监督微调更能保持广泛的解决问题的能力。
English: The study reveals that while many language models excel in math benchmarks, their performance does not generalize to other domains like science or coding, with reinforcement learning proving more effective than supervised fine-tuning for maintaining broader problem-solving capabilities.
Authors:Dhruv Agarwal, Bodhisattwa Prasad Majumder, Reece Adamson, Megha Chakravorty, Satvika Reddy Gavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, Peter Clark
Abstract:
The promise of autonomous scientific discovery (ASD) hinges not only on answering questions, but also on knowing which questions to ask. Most recent works in ASD explore the use of large language models (LLMs) in goal-driven settings, relying on human-specified research questions to guide hypothesis generation. However, scientific discovery may be accelerated further by allowing the AI system to drive exploration by its own criteria. The few existing approaches in open-ended ASD select hypotheses based on diversity heuristics or subjective proxies for human interestingness, but the former struggles to meaningfully navigate the typically vast hypothesis space, and the latter suffers from imprecise definitions. This paper presents AutoDS -- a method for open-ended ASD that instead drives scientific exploration using Bayesian surprise. Here, we quantify the epistemic shift from the LLM's prior beliefs about a hypothesis to its posterior beliefs after gathering experimental results. To efficiently explore the space of nested hypotheses, our method employs a Monte Carlo tree search (MCTS) strategy with progressive widening using surprisal as the reward function. We evaluate AutoDS in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science. Our results demonstrate that under a fixed budget, AutoDS substantially outperforms competitors by producing 5--29\% more discoveries deemed surprising by the LLM. Our human evaluation further finds that two-thirds of AutoDS discoveries are surprising to the domain experts, suggesting this is an important step forward towards building open-ended ASD systems.
中文摘要:AutoDS通过利用贝叶斯惊喜引导AI自主假设探索,在开放自主科学发现领域取得突破,其产生的发现不仅令AI模型惊讶,更有三分之二令领域专家感到意外,显著优于现有方法。
English Summary: AutoDS advances autonomous scientific discovery by using Bayesian surprise to guide AI-driven hypothesis exploration, significantly outperforming existing methods in generating discoveries that are surprising to both AI models and human experts.
Authors:Bowen Zhang, Sicheng Xu, Chuxin Wang, Jiaolong Yang, Feng Zhao, Dong Chen, Baining Guo
Abstract:
In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.
本文提出了一种新颖的视频转4D框架,通过采用在合成数据上训练的高斯变化场扩散模型,从单视频生成动态3D内容,实现了卓越的生成质量和对真实场景输入的泛化能力。
This paper introduces a novel video-to-4D framework that generates dynamic 3D content from single videos by employing a Gaussian Variation Field diffusion model trained on synthetic data, achieving superior quality and generalization to real-world inputs.
Authors:Li Siyao, Yao Feng, Omid Taheri, Chen Change Loy, Michael J. Black
Abstract:
While current general-purpose 3D human models (e.g., SMPL-X) efficiently represent accurate human shape and pose, they lacks the ability to physically interact with the environment due to the kinematic nature. As a result, kinematic-based interaction models often suffer from issues such as interpenetration and unrealistic object dynamics. To address this limitation, we introduce a novel approach that embeds SMPL-X into a tangible entity capable of dynamic physical interactions with its surroundings. Specifically, we propose a "half-physics" mechanism that transforms 3D kinematic motion into a physics simulation. Our approach maintains kinematic control over inherent SMPL-X poses while ensuring physically plausible interactions with scenes and objects, effectively eliminating penetration and unrealistic object dynamics. Unlike reinforcement learning-based methods, which demand extensive and complex training, our half-physics method is learning-free and generalizes to any body shape and motion; meanwhile, it operates in real time. Moreover, it preserves the fidelity of the original kinematic motion while seamlessly integrating physical interactions
中文: 本研究提出一种“半物理”机制,将运动学SMPL-X人体模型转化为能够实时与环境进行物理交互的实体,无需复杂训练即可消除穿模等问题,同时保持原始运动的高保真度。
English: This study introduces a "half-physics" mechanism that transforms the kinematic SMPL-X human model into a tangible entity capable of real-time, physically plausible interactions with environments, eliminating issues like interpenetration without requiring complex training.
Authors:Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks
Abstract:
Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent's ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To enable a more accurate assessment of AI agents in challenging exploratory environments, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent's capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at https://textquests.ai.
中文摘要:TextQuests作为基于互动小说的新型基准,旨在评估AI智能体在无需外部工具的探索性环境中自主解决问题和进行长上下文推理的能力。
English Summary: TextQuests is introduced as a new benchmark using interactive fiction games to evaluate AI agents' autonomous problem-solving and long-context reasoning in exploratory environments without external tools.
Authors:Saleh Vatan Khah, Savelii Chezhegov, Shahrokh Farahmand, Samuel Horváth, Eduard Gorbunov
Abstract:
Gradient clipping is a fundamental tool in Deep Learning, improving the high-probability convergence of stochastic first-order methods like SGD, AdaGrad, and Adam under heavy-tailed noise, which is common in training large language models. It is also a crucial component of Differential Privacy (DP) mechanisms. However, existing high-probability convergence analyses typically require the clipping threshold to increase with the number of optimization steps, which is incompatible with standard DP mechanisms like the Gaussian mechanism. In this work, we close this gap by providing the first high-probability convergence analysis for DP-Clipped-SGD with a fixed clipping level, applicable to both convex and non-convex smooth optimization under heavy-tailed noise, characterized by a bounded central $α$-th moment assumption, $α\in (1,2]$. Our results show that, with a fixed clipping level, the method converges to a neighborhood of the optimal solution with a faster rate than the existing ones. The neighborhood can be balanced against the noise introduced by DP, providing a refined trade-off between convergence speed and privacy guarantees.
中文: 梯度裁剪在重尾噪声下提升随机优化方法的高概率收敛性,且对差分隐私至关重要;本研究首次针对固定裁剪水平的DP-Clipped-SGD提供了高概率收敛分析,实现了更快的收敛速度并优化了隐私与性能的权衡。
English: Gradient clipping enhances the convergence of stochastic optimization methods under heavy-tailed noise and is vital for differential privacy, with this study presenting the first high-probability convergence analysis for DP-Clipped-SGD using a fixed clipping level, achieving faster rates and a refined privacy-performance trade-off.
Authors:Mutian Xu, Chongjie Ye, Haolin Liu, Yushuang Wu, Jiahao Chang, Xiaoguang Han
Abstract:
3D data simulation aims to bridge the gap between simulated and real-captured 3D data, which is a fundamental problem for real-world 3D visual tasks. Most 3D data simulation methods inject predefined physical priors but struggle to capture the full complexity of real data. An optimal approach involves learning an implicit mapping from synthetic to realistic data in a data-driven manner, but progress in this solution has met stagnation in recent studies. This work explores a new solution path of data-driven 3D simulation, called Stable-Sim2Real, based on a novel two-stage depth diffusion model. The initial stage finetunes Stable-Diffusion to generate the residual between the real and synthetic paired depth, producing a stable but coarse depth, where some local regions may deviate from realistic patterns. To enhance this, both the synthetic and initial output depth are fed into a second-stage diffusion, where diffusion loss is adjusted to prioritize these distinct areas identified by a 3D discriminator. We provide a new benchmark scheme to evaluate 3D data simulation methods. Extensive experiments show that training the network with the 3D simulated data derived from our method significantly enhances performance in real-world 3D visual tasks. Moreover, the evaluation demonstrates the high similarity between our 3D simulated data and real-captured patterns. Project page: https://mutianxu.github.io/stable-sim2real/.
Chinese Summary: 本文提出Stable-Sim2Real双阶段深度扩散模型,通过首阶段生成深度残差修正、次阶段优化局部区域,显著提升了3D数据模拟质量,并在真实3D视觉任务中验证了其优越性能。
English Summary: This paper introduces Stable-Sim2Real, a two-stage depth diffusion model that enhances 3D data simulation by first generating residual depth corrections and then refining local regions, demonstrating improved performance in real-world 3D visual tasks.
Authors:David Komnick, Kathrin Lammers, Barbara Hammer, Valerie Vaquet, Fabian Hinder
Abstract:
In a world that constantly changes, it is crucial to understand how those changes impact different systems, such as industrial manufacturing or critical infrastructure. Explaining critical changes, referred to as concept drift in the field of machine learning, is the first step towards enabling targeted interventions to avoid or correct model failures, as well as malfunctions and errors in the physical world. Therefore, in this work, we extend model-based drift explanations towards causal explanations, which increases the actionability of the provided explanations. We evaluate our explanation strategy on a number of use cases, demonstrating the practical usefulness of our framework, which isolates the causally relevant features impacted by concept drift and, thus, allows for targeted intervention.
中文: 本研究将机器学习中的概念漂移解释从基于模型的方法推进到因果解释,提高了识别和处理现实系统中漂移相关问题的实际操作性。
English: This study advances concept drift explanations in machine learning by shifting from model-based to causal explanations, enhancing their practicality for identifying and addressing drift-related issues in real-world systems.
Authors:Sophie Kearney, Shu Yang, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Jason Moore, Marylyn Ritchie, Li Shen
Abstract:
Early and accurate diagnosis of Alzheimer's disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer's Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.
中文: TAP-GPT框架通过适配TableGPT2模型,利用结构化生物标志物数据进行阿尔茨海默病诊断,借助小样本学习和高效微调技术,其性能超越了通用大语言模型和专用表格预测模型。
English: The TAP-GPT framework adapts TableGPT2 for Alzheimer's disease diagnosis using structured biomarker data, outperforming advanced general-purpose LLMs and a specialized tabular model through few-shot learning and efficient fine-tuning.
Authors:Zhenyu Pan, Yutong Zhang, Jianshu Zhang, Haoran Lu, Haozheng Luo, Yuwei Han, Philip S. Yu, Manling Li, Han Liu
Abstract:
Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities. To push their reasoning ability further, recent studies explore advanced prompting schemes and post-training fine-tuning. Although these techniques improve logical accuracy, they frequently leave the models' outputs burdened with pronounced social biases. Clarifying how reasoning gains interact with bias mitigation-and whether the two objectives inherently trade off-therefore remains an open and pressing research problem. Our study begins by benchmarking three bias-mitigation strategies-supervised fine-uning (SFT), knowledge distillation (KD), and rule-based reinforcement learning (RL)-under identical conditions, establishing their baseline strengths and weaknesses. Building on these results, we vary the proportion of debias-focused and reasoning-centric samples within each paradigm to chart the reasoning-versus-bias trade-off. Our sweeps reveal a consistent sweet spot: a roughly 1:4 mix trained with reinforcement learning cuts stereotype scores by 10% while retaining 88% of the model's original reasoning accuracy, offering concrete guidance for balancing fairness and capability in MLLMs.
中文: 多模态大语言模型在推理能力上表现卓越但存在社会偏见,研究发现通过强化学习以1:4的比例混合去偏见与推理训练,能在保持88%推理准确性的同时降低10%的刻板印象,为平衡公平性与能力提供了可行方案。
English: Multimodal Large Language Models (MLLMs) show advanced reasoning but suffer from social biases, with a study finding that a 1:4 debias-to-reasoning training ratio using reinforcement learning reduces stereotypes by 10% while maintaining 88% reasoning accuracy, offering a balanced approach to fairness and capability.
Authors:Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Yibin Kang, Haozhe Zhang, Merouane Debbah, Fadhel Ayed
Abstract:
Root Cause Analysis (RCA) in mobile networks remains a challenging task due to the need for interpretability, domain expertise, and causal reasoning. In this work, we propose a lightweight framework that leverages Large Language Models (LLMs) for RCA. To do so, we introduce TeleLogs, a curated dataset of annotated troubleshooting problems designed to benchmark RCA capabilities. Our evaluation reveals that existing open-source reasoning LLMs struggle with these problems, underscoring the need for domain-specific adaptation. To address this issue, we propose a two-stage training methodology that combines supervised fine-tuning with reinforcement learning to improve the accuracy and reasoning quality of LLMs. The proposed approach fine-tunes a series of RCA models to integrate domain knowledge and generate structured, multi-step diagnostic explanations, improving both interpretability and effectiveness. Extensive experiments across multiple LLM sizes show significant performance gains over state-of-the-art reasoning and non-reasoning models, including strong generalization to randomized test variants. These results demonstrate the promise of domain-adapted, reasoning-enhanced LLMs for practical and explainable RCA in network operation and management.
中文摘要:本文提出了一种利用大型语言模型进行移动网络根因分析的轻量级框架,通过两阶段训练方法增强领域适应性,显著提升了诊断准确性和可解释性。
English Summary: This paper introduces a lightweight framework using Large Language Models (LLMs) for Root Cause Analysis in mobile networks, enhanced by a two-stage training method that significantly improves diagnostic accuracy and interpretability through domain-specific adaptation.
Authors:Yufei Li, Zexin Li, Yinglun Zhu, Cong Liu
Abstract:
Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in isolated phases, causing substantial inefficiencies (e.g., GPU idleness) and delayed adaptation to new data in distributed settings. Our empirical analysis reveals that these inefficiencies stem from dynamic request arrivals during serving and workload heterogeneity in pipeline-parallel training. To address these challenges, we propose LeMix, a system for co-locating and managing concurrent LLM serving and training workloads. LeMix integrates offline profiling, execution prediction mechanisms, and runtime scheduling to dynamically adapt resource allocation based on workload characteristics and system conditions. By understanding task-specific behaviors and co-execution interference across shared nodes, LeMix improves utilization and serving quality without compromising serving responsiveness. Our evaluation shows that LeMix improves throughput by up to 3.53x, reduces inference loss by up to 0.61x, and delivers up to 2.12x higher response time SLO attainment over traditional separate setups. To our knowledge, this is the first work to uncover and exploit the opportunities of joint LLM inference and training, paving the way for more resource-efficient deployment of LLMs in production environments.
中文摘要:LeMix系统通过动态资源管理协同调度大语言模型的推理服务与训练任务,实现了高达3.53倍的吞吐量提升,在降低推理损失的同时保障了服务响应质量。
English Summary: LeMix is a novel system that efficiently co-locates LLM inference serving and training workloads through dynamic resource management, achieving up to 3.53x higher throughput and improved responsiveness while reducing inference loss.
Authors:Joan MartÃnez Canals, Francesco Devoti, Vincenzo Sciancalepore, Marco Di Renzo, Xavier Costa-Pérez
Abstract:
Beam shaping techniques enable tailored beam trajectories, offering unprecedented connectivity opportunities in wireless communications. Current approaches rely on flat apertures, which limit trajectory flexibility due to inherent geometric constraints. To overcome such restrictions, we propose adopting curved apertures as a more versatile alternative for beam shaping. We introduce a novel formulation for wave trajectory engineering compatible with arbitrarily shaped apertures. Theoretical and numerical analyses demonstrate that curved apertures offer improved control over wave propagation, are more resilient to phase control constraints, and achieve higher power density across a wider portion of the desired beam trajectory than flat apertures.
中文: 相比平面孔径,曲面孔径在波束成形中能提供更优的轨迹控制、更强的相位约束适应性和更广范围内更高的功率密度。
English: Curved apertures provide superior beam shaping with enhanced trajectory control, resilience to phase constraints, and higher power density compared to flat apertures.
Authors:Zhipeng Tang, Sha Zhang, Jiajun Deng, Chenjie Wang, Guoliang You, Yuting Huang, Xinrui Lin, Yanyong Zhang
Abstract:
Integrating large language models (LLMs) into autonomous driving motion planning has recently emerged as a promising direction, offering enhanced interpretability, better controllability, and improved generalization in rare and long-tail scenarios. However, existing methods often rely on abstracted perception or map-based inputs, missing crucial visual context, such as fine-grained road cues, accident aftermath, or unexpected obstacles, which are essential for robust decision-making in complex driving environments. To bridge this gap, we propose VLMPlanner, a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images. The VLM processes multi-view images to capture rich, detailed visual information and leverages its common-sense reasoning capabilities to guide the real-time planner in generating robust and safe trajectories. Furthermore, we develop the Context-Adaptive Inference Gate (CAI-Gate) mechanism that enables the VLM to mimic human driving behavior by dynamically adjusting its inference frequency based on scene complexity, thereby achieving an optimal balance between planning performance and computational efficiency. We evaluate our approach on the large-scale, challenging nuPlan benchmark, with comprehensive experimental results demonstrating superior planning performance in scenarios with intricate road conditions and dynamic elements. Code will be available.
中文:VLMPlanner是一种混合自动驾驶框架,通过结合视觉语言模型与实时规划器处理多视角图像,在复杂场景中提升决策能力,并采用动态推理机制实现性能与效率的最优平衡。
English: VLMPlanner is a hybrid autonomous driving framework that integrates a vision-language model with a real-time planner to process multi-view images for enhanced decision-making in complex scenarios, featuring a dynamic inference mechanism for optimal performance and efficiency.
Authors:Xiang Fei, Siqi Wang, Shu Wei, Yuxiang Nie, Wei Shi, Hao Feng, Chao Feng, Can Huang
Abstract:
Current language model training paradigms typically terminate learning upon reaching the end-of-sequence () token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point.
To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization.
Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.
Post-Completion Learning (PCL) extends language model training beyond output completion to enhance reasoning and self-evaluation through continued generation and alignment with reward functions, improving performance while maintaining efficient inference.
English Summary:
Authors:Xin Yin, Chao Ni, Liushan Chen, Xiaohu Yang
Abstract:
Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human preferences, the optimal training strategy remains unclear across diverse code preference scenarios. This paper systematically investigates the roles of SFT and DPO in aligning LLMs with different code preferences. Through both theoretical analysis and empirical observation, we hypothesize that SFT excels in scenarios with objectively verifiable optimal solutions, while applying SFT followed by DPO (S&D) enables models to explore superior solutions in scenarios without objectively verifiable optimal solutions. Based on the analysis and experimental evidence, we propose Adaptive Preference Optimization (APO), a dynamic integration approach that adaptively amplifies preferred responses, suppresses dispreferred ones, and encourages exploration of potentially superior solutions during training. Extensive experiments across six representative code preference tasks validate our theoretical hypotheses and demonstrate that APO consistently matches or surpasses the performance of existing SFT and S&D strategies. Our work provides both theoretical foundations and practical guidance for selecting appropriate training strategies in different code preference alignment scenarios.
中文: 本文提出自适应偏好优化(APO),这是一种动态整合监督微调与直接偏好优化的训练策略,能根据不同代码偏好场景自适应调整,在六项代表性任务中持续达到或超越现有方法的性能表现。
English: This paper introduces Adaptive Preference Optimization (APO), a dynamic training strategy that adaptively combines Supervised Fine-Tuning and Direct Preference Optimization to align Large Language Models with diverse code preferences, consistently matching or surpassing existing methods across six tasks.
Authors:Sara M. Ichinaga, Steven L. Brunton, Aleksandr Y. Aravkin, J. Nathan Kutz
Abstract:
The dynamic mode decomposition (DMD) is a data-driven approach that extracts the dominant features from spatiotemporal data. In this work, we introduce sparse-mode DMD, a new variant of the optimized DMD framework that specifically leverages sparsity-promoting regularization in order to approximate DMD modes which have localized spatial structure. The algorithm maintains the noise-robust properties of optimized DMD while disambiguating between modes which are spatially local versus global in nature. In many applications, such modes are associated with discrete and continuous spectra respectively, thus allowing the algorithm to explicitly construct, in an unsupervised manner, the distinct portions of the spectrum. We demonstrate this by analyzing synthetic and real-world systems, including examples from optical waveguides, quantum mechanics, and sea surface temperature data.
中文: 本文提出稀疏模态动态模式分解,作为优化DMD框架的新变体,通过稀疏正则化识别空间局部化模态,在保持噪声鲁棒性的同时实现离散谱与连续谱的无监督分离,并在波导、量子力学和海温数据中得到验证。
English: This paper introduces sparse-mode DMD, a novel variant of optimized dynamic mode decomposition that uses sparsity regularization to identify spatially localized modes while maintaining noise robustness, enabling unsupervised separation of discrete and continuous spectral components across various applications.
Authors:Hsin-Yi Lin, Samuel Yen-Chi Chen, Huan-Hsin Tseng, Shinjae Yoo
Abstract:
Hybrid quantum-classical frameworks leverage quantum computing for machine learning; however, variational quantum circuits (VQCs) are limited by the need for local measurements. We introduce an adaptive non-local observable (ANO) paradigm within VQCs for quantum reinforcement learning (QRL), jointly optimizing circuit parameters and multi-qubit measurements. The ANO-VQC architecture serves as the function approximator in Deep Q-Network (DQN) and Asynchronous Advantage Actor-Critic (A3C) algorithms. On multiple benchmark tasks, ANO-VQC agents outperform baseline VQCs. Ablation studies reveal that adaptive measurements enhance the function space without increasing circuit depth. Our results demonstrate that adaptive multi-qubit observables can enable practical quantum advantages in reinforcement learning.
中文: ANO-VQC框架通过联合优化电路参数和自适应多量子比特测量,在量子强化学习中实现了优于传统变分量子电路的性能,且无需增加电路复杂度。
English: The ANO-VQC framework enhances quantum reinforcement learning by jointly optimizing circuit parameters and adaptive multi-qubit measurements, outperforming traditional VQCs without increasing circuit complexity.
Authors:Ali Shakeri, Wei Emma Zhang, Amin Beheshti, Weitong Chen, Jian Yang, Lishan Yang
Abstract:
Pre-trained Language Models (PLMs) have demonstrated impressive performance in various NLP tasks. However, traditional fine-tuning methods for leveraging PLMs for downstream tasks entail significant computational overhead. Prompt-tuning has emerged as an efficient alternative that involves prepending a limited number of parameters to the input sequence and only updating them while the PLM's parameters are frozen. However, this technique's prompts remain fixed for all inputs, reducing the model's flexibility. The Federated Learning (FL) technique has gained attention in recent years to address the growing concerns around data privacy. However, challenges such as communication and computation limitations of clients still need to be addressed. To mitigate these challenges, this paper introduces the Federated Dynamic Prompt Generator (FedDPG), which incorporates a dynamic prompt generator network to generate context-aware prompts based on the given input, ensuring flexibility and adaptability while prioritising data privacy in federated learning settings. Our experiments on three NLP benchmark datasets showcase that FedDPG outperforms the state-of-the-art parameter-efficient fine-tuning methods in terms of global model performance, and has significantly reduced the calculation time and the number of parameters to be sent through the FL network.
中文: FedDPG通过动态提示生成器产生上下文感知提示,在联邦学习中提升灵活性和数据隐私保护,同时以更低的计算成本超越现有方法。
English: FedDPG introduces a dynamic prompt generator to create context-aware prompts, enhancing flexibility and data privacy in federated learning while outperforming existing methods with reduced computational costs.
Authors:Yaowei Bai, Ruiheng Zhang, Yu Lei, Jingfeng Yao, Shuguang Ju, Chaoyang Wang, Wei Yao, Yiwan Guo, Guilin Zhang, Chao Wan, Qian Yuan, Xuhua Duan, Xinggang Wang, Tao Sun, Yongchao Xu, Chuansheng Zheng, Huangxuan Zhao, Bo Du
Abstract:
A global shortage of radiologists has been exacerbated by the significant volume of chest X-ray workloads, particularly in primary care. Although multimodal large language models show promise, existing evaluations predominantly rely on automated metrics or retrospective analyses, lacking rigorous prospective clinical validation. Janus-Pro-CXR (1B), a chest X-ray interpretation system based on DeepSeek Janus-Pro model, was developed and rigorously validated through a multicenter prospective trial (NCT06874647). Our system outperforms state-of-the-art X-ray report generation models in automated report generation, surpassing even larger-scale models including ChatGPT 4o (200B parameters), while demonstrating robust detection of eight clinically critical radiographic findings (area under the curve, AUC > 0.8). Retrospective evaluation confirms significantly higher report accuracy than Janus-Pro and ChatGPT 4o. In prospective clinical deployment, AI assistance significantly improved report quality scores (4.37 vs. 4.11, P < 0.001), reduced interpretation time by 18.5% (P < 0.001), and was preferred by a majority of experts (3 out of 5) in 52.7% of cases. Through lightweight architecture and domain-specific optimization, Janus-Pro-CXR improves diagnostic reliability and workflow efficiency, particularly in resource-constrained settings. The model architecture and implementation framework will be open-sourced to facilitate the clinical translation of AI-assisted radiology solutions.
中文: Janus-Pro-CXR 胸部X光智能诊断系统通过前瞻性临床验证,在提升诊断准确性和工作效率方面表现卓越,其轻量化架构特别适合资源有限的环境使用。
English: Janus-Pro-CXR, a lightweight AI system for chest X-ray interpretation, demonstrates superior accuracy in detecting critical findings and generating reports while significantly improving clinical workflow efficiency and diagnostic reliability in prospective trials.
Authors:Muhammad Ahmed Mohsin, Sagnik Bhattacharya, Abhiram Gorle, Muhammad Ali Jamshed, John M. Cioffi
Abstract:
Wireless support of virtual reality (VR) has challenges when a network has multiple users, particularly for 3D VR gaming, digital AI avatars, and remote team collaboration. This work addresses these challenges through investigation of the low-rank channels that inevitably occur when there are more active users than there are degrees of spatial freedom, effectively often the number of antennas. The presented approach uses optimal nonlinear transceivers, equivalently generalized decision-feedback or successive cancellation for uplink and superposition or dirty-paper precoders for downlink. Additionally, a powerful optimization approach for the users' energy allocation and decoding order appears to provide large improvements over existing methods, effectively nearing theoretical optima. As the latter optimization methods pose real-time challenges, approximations using deep reinforcement learning (DRL) are used to approximate best performance with much lower (5x at least) complexity. Experimental results show significantly larger sum rates and very large power savings to attain the data rates found necessary to support VR. Experimental results show the proposed algorithm outperforms current industry standards like orthogonal multiple access (OMA), non-orthogonal multiple access (NOMA), as well as the highly researched methods in multi-carrier NOMA (MC-NOMA), enhancing sum data rate by 39%, 28%, and 16%, respectively, at a given power level. For the same data rate, it achieves power savings of 75%, 45%, and 40%, making it ideal for VR applications. Additionally, a near-optimal deep reinforcement learning (DRL)-based resource allocation framework for real-time use by being 5x faster and reaching 83% of the global optimum is introduced.
中文: 本研究通过采用最优非线性收发器和基于深度强化学习的资源分配框架,解决了多用户网络中无线VR的挑战,相比现有方法显著提升了数据速率并大幅降低了功耗。
English: This study tackles wireless VR challenges in multi-user networks by employing optimal nonlinear transceivers and a deep reinforcement learning-based resource allocation framework, achieving significant data rate improvements and power savings compared to existing methods.
Authors:Hao Li, Yizheng Sun, Viktor Schlegel, Kailai Yang, Riza Batista-Navarro, Goran Nenadic
Abstract:
Argument summarization aims to generate concise, structured representations of complex, multi-perspective debates. While recent work has advanced the identification and clustering of argumentative components, the generation stage remains underexplored. Existing approaches typically rely on single-pass generation, offering limited support for factual correction or structural refinement. To address this gap, we introduce Arg-LLaDA, a novel large language diffusion framework that iteratively improves summaries via sufficiency-guided remasking and regeneration. Our method combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans, yielding more faithful, concise, and coherent outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics. In addition, human evaluations reveal substantial improvements across core dimensions, coverage, faithfulness, and conciseness, validating the effectiveness of our iterative, sufficiency-aware generation strategy.
中文: Arg-LLaDA是一种基于充分性引导重构与迭代生成的大语言扩散框架,通过改进摘要的忠实性、简洁性和连贯性,在自动评估和人工评估中均显著优于现有方法。
English: Arg-LLaDA is an iterative large language diffusion framework that enhances argument summaries through sufficiency-guided remasking and regeneration, outperforming existing methods in both automatic and human evaluations.
Authors:Hongyuan Lu, Zixuan Li, Zefan Zhang, Wai Lam
Abstract:
There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbf{A}utomatic \textbf{D}ictionary \textbf{S}election (\textbf{ADS}). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbf{S}elect \textbf{Lo}w-frequency \textbf{W}ords! (\textbf{SLoW}) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnote{A shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.}\footnote{Code and data available upon publication.}
中文: 本文提出自动词典选择(ADS)这一新任务,旨在通过选择性使用词典来优化翻译性能与计算资源消耗的平衡,并开发了SLoW方法——无需访问训练数据或调整大语言模型即可优先选择低频词典,显著提升翻译效果。
English: This paper introduces Automatic Dictionary Selection (ADS), a novel task that optimizes the trade-off between token usage and translation performance by selectively using dictionaries, and proposes the SLoW method which prioritizes low-frequency words without requiring training data access or additional LLM tuning.
Authors:Xin Wei, Qin Yang, Yijie Fang, Mingrui Zhu, Nannan Wang
Abstract:
While test-time adaptation (TTA) methods effectively address domain shifts by dynamically adapting pre-trained models to target domain data during online inference, their application to 3D point clouds is hindered by their irregular and unordered structure. Current 3D TTA methods often rely on computationally expensive spatial-domain optimizations and may require additional training data. In contrast, we propose Graph Spectral Domain Test-Time Adaptation (GSDTTA), a novel approach for 3D point cloud classification that shifts adaptation to the graph spectral domain, enabling more efficient adaptation by capturing global structural properties with fewer parameters. Point clouds in target domain are represented as outlier-aware graphs and transformed into graph spectral domain by Graph Fourier Transform (GFT). For efficiency, adaptation is performed by optimizing only the lowest 10% of frequency components, which capture the majority of the point cloud's energy. An inverse GFT (IGFT) is then applied to reconstruct the adapted point cloud with the graph spectral-driven point shift. This process is enhanced by an eigenmap-guided self-training strategy that iteratively refines both the spectral adjustments and the model parameters. Experimental results and ablation studies on benchmark datasets demonstrate the effectiveness of GSDTTA, outperforming existing TTA methods for 3D point cloud classification.
中文: 提出的图谱域测试时适应方法通过将点云转换至图谱域并仅优化关键频率成分,以更少参数高效捕获全局结构特性,在三维点云分类中优于现有方法。
English: The proposed Graph Spectral Domain Test-Time Adaptation (GSDTTA) efficiently adapts 3D point cloud models by transforming data into the graph spectral domain and optimizing only key frequency components, outperforming existing methods with fewer parameters and enhanced structural capture.
Authors:Hongjie Chen, Zehan Li, Yaodong Song, Wenming Deng, Yitong Yao, Yuxin Zhang, Hang Lv, Xuechao Zhu, Jian Kang, Jie Lian, Jie Li, Chao Wang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li
Abstract:
Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy that progressively aligns linguistic, paralinguistic, and speaker characteristic information using large-scale speech-text corpora. Experimental results on TELEVAL, a multi-dimensional evaluation benchmark, demonstrate that GOAT-SLM achieves well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions. This work highlights the importance of modeling beyond linguistic content and advances the development of more natural, adaptive, and socially aware spoken language systems.
Chinese: GOAT-SLM是一种新型语音语言模型,通过双模态架构和分阶段训练策略,实现了对副语言和说话人特征的感知,在语义和非语义任务中均取得均衡优异的表现。
English: GOAT-SLM is a novel spoken language model that incorporates paralinguistic and speaker characteristic awareness, achieving balanced performance in both semantic and non-semantic tasks through its dual-modality architecture and staged training strategy.
Authors:Zehan Li, Hongjie Chen, Yuxin Zhang, Jing Zhou, Xuening Wang, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li
Abstract:
Spoken language models (SLMs) have seen rapid progress in recent years, along with the development of numerous benchmarks for evaluating their performance. However, most existing benchmarks primarily focus on evaluating whether SLMs can perform complex tasks comparable to those tackled by large language models (LLMs), often failing to align with how users naturally interact in real-world conversational scenarios. In this paper, we propose TELEVAL, a dynamic benchmark specifically designed to evaluate SLMs' effectiveness as conversational agents in realistic Chinese interactive settings. TELEVAL defines three evaluation dimensions: Explicit Semantics, Paralinguistic and Implicit Semantics, and System Abilities. It adopts a dialogue format consistent with real-world usage and evaluates text and audio outputs separately. TELEVAL particularly focuses on the model's ability to extract implicit cues from user speech and respond appropriately without additional instructions. Our experiments demonstrate that despite recent progress, existing SLMs still have considerable room for improvement in natural conversational tasks. We hope that TELEVAL can serve as a user-centered evaluation framework that directly reflects the user experience and contributes to the development of more capable dialogue-oriented SLMs.
中文: 本文提出TELEVAL这一动态评测基准,专门针对中文真实交互场景评估口语模型的对话能力,重点关注隐式线索理解,实验表明现有模型在自然对话任务上仍有较大提升空间。
English: This paper introduces TELEVAL, a dynamic benchmark designed to assess spoken language models as conversational agents in realistic Chinese settings, focusing on implicit cue understanding and showing current models still need significant improvement in natural dialogue tasks.
Authors:Zihan Wang, Xinzhang Liu, Yitong Yao, Chao Wang, Yu Zhao, Zhihao Yang, Wenmin Deng, Kaipeng Jia, Jiaxin Peng, Yuyao Huang, Sishi Xiong, Zhuo Jiang, Kaidong Yu, Xiaohui Hu, Fubei Yao, Ruiyu Fang, Zhuoru Jiang, Ruiting Song, Qiyi Xie, Rui Xue, Xuewei He, Yanlei Xue, Zhu Yuan, Zhaoxi Zhang, Zilu Huang, Shiquan Wang, Xin Wang, Hanming Wu, Mingyuan Wang, Xufeng Zhan, Yuhan Sun, Zhaohu Xing, Yuhao Jiang, Bingkai Yang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li
Abstract:
We introduce the latest series of TeleChat models: \textbf{TeleChat2}, \textbf{TeleChat2.5}, and \textbf{T1}, offering a significant upgrade over their predecessor, TeleChat. Despite minimal changes to the model architecture, the new series achieves substantial performance gains through enhanced training strategies in both pre-training and post-training stages. The series begins with \textbf{TeleChat2}, which undergoes pretraining on 10 trillion high-quality and diverse tokens. This is followed by Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to further enhance its capabilities. \textbf{TeleChat2.5} and \textbf{T1} expand the pipeline by incorporating a continual pretraining phase with domain-specific datasets, combined with reinforcement learning (RL) to improve performance in code generation and mathematical reasoning tasks. The \textbf{T1} variant is designed for complex reasoning, supporting long Chain-of-Thought (CoT) reasoning and demonstrating substantial improvements in mathematics and coding. In contrast, \textbf{TeleChat2.5} prioritizes speed, delivering rapid inference. Both flagship models of \textbf{T1} and \textbf{TeleChat2.5} are dense Transformer-based architectures with 115B parameters, showcasing significant advancements in reasoning and general task performance compared to the original TeleChat. Notably, \textbf{T1-115B} outperform proprietary models such as OpenAI's o1-mini and GPT-4o. We publicly release \textbf{TeleChat2}, \textbf{TeleChat2.5} and \textbf{T1}, including post-trained versions with 35B and 115B parameters, to empower developers and researchers with state-of-the-art language models tailored for diverse applications.
中文: TeleChat系列推出TeleChat2、TeleChat2.5和T1模型,通过改进的训练策略实现显著性能提升,其中T1-115B在复杂推理任务上超越了GPT-4o等专有模型。
English: The TeleChat series introduces TeleChat2, TeleChat2.5, and T1 models that achieve substantial performance improvements through enhanced training strategies, with T1-115B outperforming proprietary models like GPT-4o in complex reasoning tasks.
Authors:Xiaoxue Chen, Bhargav Chandaka, Chih-Hao Lin, Ya-Qin Zhang, David Forsyth, Hao Zhao, Shenlong Wang
Abstract:
We present InvRGB+L, a novel inverse rendering model that reconstructs large, relightable, and dynamic scenes from a single RGB+LiDAR sequence. Conventional inverse graphics methods rely primarily on RGB observations and use LiDAR mainly for geometric information, often resulting in suboptimal material estimates due to visible light interference. We find that LiDAR's intensity values-captured with active illumination in a different spectral range-offer complementary cues for robust material estimation under variable lighting. Inspired by this, InvRGB+L leverages LiDAR intensity cues to overcome challenges inherent in RGB-centric inverse graphics through two key innovations: (1) a novel physics-based LiDAR shading model and (2) RGB-LiDAR material consistency losses. The model produces novel-view RGB and LiDAR renderings of urban and indoor scenes and supports relighting, night simulations, and dynamic object insertions, achieving results that surpass current state-of-the-art methods in both scene-level urban inverse rendering and LiDAR simulation.
Chinese Summary: InvRGB+L是一种新型逆向渲染模型,通过结合激光雷达强度线索与物理着色模型和材质一致性约束,从RGB+激光雷达数据重建大规模动态场景,在场景重建和模拟能力上超越现有方法。
English Summary: InvRGB+L is a novel inverse rendering model that reconstructs large, dynamic scenes from RGB+LiDAR data by leveraging LiDAR intensity cues through a physics-based shading model and material consistency losses, outperforming existing methods in scene reconstruction and simulation capabilities.
Authors:Gert Vankan, Valentina Breschi, Simone Formentin
Abstract:
Data-driven predictive control approaches, in general, and Data-enabled Predictive Control (DeePC), in particular, exploit matrices of raw input/output trajectories for control design. These data are typically gathered only from the system to be controlled. Nonetheless, the increasing connectivity and inherent similarity of (mass-produced) systems have the potential to generate a considerable amount of information that can be exploited to undertake a control task. In light of this, we propose a preliminary federated extension of DeePC that leverages a combination of input/output trajectories from multiple similar systems for predictive control. Supported by a suite of numerical examples, our analysis unveils the potential benefits of exploiting information from similar systems and its possible downsides.
中文: 本研究提出了一种数据驱动预测控制(DeePC)的联邦扩展方法,通过整合多个相似系统的输入/输出轨迹来提升预测控制性能,同时揭示了利用此类信息的优势与潜在风险。
English: The study introduces a federated extension of Data-enabled Predictive Control (DeePC) that utilizes input/output trajectories from multiple similar systems to enhance predictive control, highlighting both the benefits and potential drawbacks of this approach.
Authors:Qing Wang, Zehan Li, Hang Lv, Hongjie Chen, Yaodong Song, Jian Kang, Jie Lian, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li
Abstract:
Human communication involves more than explicit semantics, with implicit signals and contextual cues playing a critical role in shaping meaning. However, modern speech technologies, such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) often fail to capture these beyond-semantic dimensions. To better characterize and benchmark the progression of speech intelligence, we introduce Spoken Interaction System Capability Levels (L1-L5), a hierarchical framework illustrated the evolution of spoken dialogue systems from basic command recognition to human-like social interaction. To support these advanced capabilities, we propose Beyond-Semantic Speech (BoSS), which refers to the set of information in speech communication that encompasses but transcends explicit semantics. It conveys emotions, contexts, and modifies or extends meanings through multidimensional features such as affective cues, contextual dynamics, and implicit semantics, thereby enhancing the understanding of communicative intentions and scenarios. We present a formalized framework for BoSS, leveraging cognitive relevance theories and machine learning models to analyze temporal and contextual speech dynamics. We evaluate BoSS-related attributes across five different dimensions, reveals that current spoken language models (SLMs) are hard to fully interpret beyond-semantic signals. These findings highlight the need for advancing BoSS research to enable richer, more context-aware human-machine communication.
中文: 人类交流依赖于超越显性语义的隐含信号,但现有语音技术常忽略这些层面,为此提出了层级框架和超语义语音(BoSS)以推动情境感知系统发展,然而评估显示当前模型仍难以完全解析这类信息。
English: Human communication relies on implicit cues beyond explicit semantics, yet current speech technologies often miss these dimensions, prompting the introduction of a hierarchical framework and Beyond-Semantic Speech (BoSS) to advance context-aware systems, though evaluations show existing models still struggle with full interpretation.
Authors:Miguel Escudero-Jiménez, Noé Pérez-Higueras, Andrés MartÃnez-Silva, Fernando Caballero, Luis Merino
Abstract:
This work presents a new iteration of the Human Navigation Simulator (HuNavSim), a novel open-source tool for the simulation of different human-agent navigation behaviors in scenarios with mobile robots. The tool, programmed under the ROS 2 framework, can be used together with different well-known robotics simulators such as Gazebo or NVidia Isaac Sim. The main goal is to facilitate the development and evaluation of human-aware robot navigation systems in simulation. In this new version, several features have been improved and new ones added, such as the extended set of actions and conditions that can be combined in Behavior Trees to compound complex and realistic human behaviors.
中文: 新版HuNavSim作为开源工具,增强了在移动机器人场景中模拟多样化人机导航行为的能力,通过扩展行为树动作等新功能构建复杂逼真的人类行为,便于开发和评估人类感知的机器人导航系统。
English: This updated version of HuNavSim enhances its capabilities as an open-source tool for simulating diverse human-agent navigation behaviors alongside mobile robots, featuring improved and new functionalities like expanded Behavior Tree actions to create complex, realistic human behaviors for developing human-aware robot navigation systems.
Authors:Jianpeng Qi, Chao Liu, Rui Wang, Junyu Dong, Yanwei Yu
Abstract:
Timely and efficient dissemination of server status is critical in compute-first networking systems, where user tasks arrive dynamically and computing resources are limited and stochastic. In such systems, the access point plays a key role in forwarding tasks to a server based on its latest received server status. However, modeling the task-success probability suffering the factors of stochastic arrivals, limited server capacity, and bidirectional link delays. Therefore, we introduce a unified analytical framework that abstracts the AP forwarding rule as a single probability and models all network and waiting delays via their Laplace transforms. This approach yields a closed form expression for the end to end task success probability, together with upper and lower bounds that capture Erlang loss blocking, information staleness, and random uplink/downlink delays. We validate our results through simulations across a wide range of parameters, showing that theoretical predictions and bounds consistently enclose observed success rates. Our framework requires only two interchangeable inputs (the forwarding probability and the delay transforms), making it readily adaptable to alternative forwarding policies and delay distributions. Experiments demonstrate that our bounds are able to achieve accuracy within 0.01 (upper bound) and 0.016 (lower bound) of the empirical task success probability.
Chinese: 本研究提出一个统一分析框架,通过抽象化接入点转发规则和网络延迟,对计算优先网络系统中的任务成功概率进行建模,提供了封闭表达式和经仿真验证的紧密度边界。
English: This study presents a unified analytical framework that models task-success probability in compute-first networking systems by abstracting access point forwarding rules and network delays, providing closed-form expressions and tight bounds validated through simulations.
Authors:Cheng Liu, Yifei Lu, Fanghua Ye, Jian Li, Xingyu Chen, Feiliang Ren, Zhaopeng Tu, Xiaolong Li
Abstract:
Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs). Existing approaches typically rely on prompt engineering or supervised fine-tuning to enable models to imitate character behaviors in specific scenarios, but often neglect the underlying \emph{cognitive} mechanisms driving these behaviors. Inspired by cognitive psychology, we introduce \textbf{CogDual}, a novel RPLA adopting a \textit{cognize-then-respond } reasoning paradigm. By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment. To further optimize the performance, we employ reinforcement learning with two general-purpose reward schemes designed for open-domain text generation. Extensive experiments on the CoSER benchmark, as well as Cross-MR and LifeChoice, demonstrate that CogDual consistently outperforms existing baselines and generalizes effectively across diverse role-playing tasks.
中文: CogDual是一种角色扮演语言代理,采用认知推理方法结合外部情境和内部自我意识,通过强化学习优化,在多项任务中表现优于现有基准并具备良好泛化能力。
English: CogDual is a role-playing language agent that uses a cognitive reasoning approach combining external and internal awareness to enhance response consistency and alignment, outperforming existing methods through reinforcement learning and generalizing across tasks.
Authors:Zelong Liu, Yuliang Gu, Zhichao Sun, Huachao Zhu, Xin Xiao, Bo Du, Laurent Najman, Yongchao Xu
Abstract:
Crack detection is an important task in computer vision. Despite impressive in-dataset performance, deep learning-based methods still struggle in generalizing to unseen domains. The thin structure property of cracks is usually overlooked by previous methods. In this work, we introduce CrackCue, a novel method for robust crack detection based on coarse-to-fine crack cue generation. The core concept lies on leveraging the thin structure property to generate a robust crack cue, guiding the crack detection. Specifically, we first employ a simple max-pooling and upsampling operation on the crack image. This results in a coarse crack-free background, based on which a fine crack-free background can be obtained via a reconstruction network. The difference between the original image and fine crack-free background provides a fine crack cue. This fine cue embeds robust crack prior information which is unaffected by complex backgrounds, shadow, and varied lighting. As a plug-and-play method, we incorporate the proposed CrackCue into three advanced crack detection networks. Extensive experimental results demonstrate that the proposed CrackCue significantly improves the generalization ability and robustness of the baseline methods. The source code will be publicly available.
中文: CrackCue是一种新颖的即插即用方法,通过从粗到细的背景重建生成鲁棒的裂纹线索,有效提升裂纹检测在复杂场景下的泛化能力和鲁棒性。
English: CrackCue is a novel plug-and-play method that enhances crack detection by generating robust crack cues through coarse-to-fine background reconstruction, significantly improving generalization and robustness across complex conditions.
Authors:Ismail Hossain, Sai Puppala, Md Jahangir Alam, Sajedul Talukder
Abstract:
Social media platforms serve as a significant medium for sharing personal emotions, daily activities, and various life events, ensuring individuals stay informed about the latest developments. From the initiation of an account, users progressively expand their circle of friends or followers, engaging actively by posting, commenting, and sharing content. Over time, user behavior on these platforms evolves, influenced by demographic attributes and the networks they form. In this study, we present a novel approach that leverages open-source models Llama-3-Instruct, Mistral-7B-Instruct, Gemma-7B-IT through prompt engineering, combined with GPT-2, BERT, and RoBERTa using a joint embedding technique, to analyze and predict the evolution of user behavior on social media over their lifetime. Our experiments demonstrate the potential of these models to forecast future stages of a user's social evolution, including network changes, future connections, and shifts in user activities. Experimental results highlight the effectiveness of our approach, with GPT-2 achieving the lowest perplexity (8.21) in a Cross-modal configuration, outperforming RoBERTa (9.11) and BERT, and underscoring the importance of leveraging Cross-modal configurations for superior performance. This approach addresses critical challenges in social media, such as friend recommendations and activity predictions, offering insights into the trajectory of user behavior. By anticipating future interactions and activities, this research aims to provide early warnings about potential negative outcomes, enabling users to make informed decisions and mitigate risks in the long term.
中文: 本研究采用Llama-3-Instruct和GPT-2等开源模型的新方法,预测社交媒体用户行为的演变,可对潜在风险发出早期预警,帮助用户做出明智决策。
English: This study introduces a novel method using open-source models like Llama-3-Instruct and GPT-2 to predict the evolution of user behavior on social media, enabling early warnings about potential risks and aiding in informed decision-making.